INFO 2025-05-16 19:03:22,088 instructlab.model.download:302: Available models (`ilab model list`):
+-----------------------------------+---------------------+---------+---------------------------------------------------------------------------+
| Model Name                        | Last Modified       | Size    | Absolute path                                                             |
+-----------------------------------+---------------------+---------+---------------------------------------------------------------------------+
| models/granite-3.1-8b-lab-v2      | 2025-05-16 17:52:05 | 15.6 GB | /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2      |
| models/granite-3.1-8b-starter-v2  | 2025-05-16 17:58:25 | 15.6 GB | /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-starter-v2  |
| models/mixtral-8x7b-instruct-v0-1 | 2025-05-16 18:31:11 | 87.0 GB | /var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1 |
| models/prometheus-8x7b-v2-0       | 2025-05-16 19:03:22 | 87.0 GB | /var/home/cloud-user/.cache/instructlab/models/prometheus-8x7b-v2-0       |
+-----------------------------------+---------------------+---------+---------------------------------------------------------------------------+
+ ilab taxonomy diff
compositional_skills/grounded/linguistics/inclusion/qna.yaml
compositional_skills/grounded/linguistics/writing/rewriting/qna.yaml
compositional_skills/linguistics/synonyms/qna.yaml
knowledge/arts/music/fandom/swifties/qna.yaml
knowledge/science/animals/birds/black_capped_chickadee/qna.yaml
Taxonomy in /var/home/cloud-user/.local/share/instructlab/taxonomy is valid :)
+ ilab model serve
INFO 2025-05-16 19:03:38,224 instructlab.model.serve_backend:80: Setting backend_type in the serve config to vllm
INFO 2025-05-16 19:03:38,241 instructlab.model.serve_backend:86: Using model '/var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2' with -1 gpu-layers and 4096 max context size.
INFO 2025-05-16 19:03:43,773 instructlab.model.serve_backend:133: '--gpus' flag used alongside '--tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus flag.
INFO 2025-05-16 19:03:44,059 instructlab.model.backends.vllm:332: vLLM starting up on pid 6 at http://127.0.0.1:8000/v1
INFO 05-16 19:04:08 [__init__.py:239] Automatically detected platform rocm.
INFO 05-16 19:04:11 [api_server.py:1034] vLLM API server version 0.8.4
INFO 05-16 19:04:11 [api_server.py:1035] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/tmp/tmpgur8m2cd', chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend='mp', pipeline_parallel_size=1, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2', 'granite-3.1-8b-lab-v2', 'models/granite-3.1-8b-lab-v2', 'models/granite-3.1-8b-starter-v2', 'models/mixtral-8x7b-instruct-v0-1', 'models/prometheus-8x7b-v2-0'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 05-16 19:04:32 [config.py:689] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 05-16 19:04:32 [arg_utils.py:1742] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
WARNING 05-16 19:04:32 [arg_utils.py:1603] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 05-16 19:04:46 [api_server.py:246] Started engine process with PID 59
INFO 05-16 19:04:50 [__init__.py:239] Automatically detected platform rocm.
INFO 05-16 19:04:51 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2', speculative_config=None, tokenizer='/var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 05-16 19:04:51 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 104 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 05-16 19:04:59 [__init__.py:239] Automatically detected platform rocm.
INFO 05-16 19:04:59 [__init__.py:239] Automatically detected platform rocm.
INFO 05-16 19:04:59 [__init__.py:239] Automatically detected platform rocm.
INFO 05-16 19:04:59 [__init__.py:239] Automatically detected platform rocm.
INFO 05-16 19:04:59 [__init__.py:239] Automatically detected platform rocm.
INFO 05-16 19:04:59 [__init__.py:239] Automatically detected platform rocm.
INFO 05-16 19:04:59 [__init__.py:239] Automatically detected platform rocm.
(VllmWorkerProcess pid=86) INFO 05-16 19:05:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=85) INFO 05-16 19:05:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=87) INFO 05-16 19:05:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=84) INFO 05-16 19:05:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=82) INFO 05-16 19:05:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=83) INFO 05-16 19:05:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=81) INFO 05-16 19:05:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
INFO 05-16 19:05:38 [rocm.py:153] None is not supported in AMD GPUs.
INFO 05-16 19:05:38 [rocm.py:154] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=85) INFO 05-16 19:06:57 [rocm.py:153] None is not supported in AMD GPUs.
(VllmWorkerProcess pid=85) INFO 05-16 19:06:57 [rocm.py:154] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=83) INFO 05-16 19:06:57 [rocm.py:153] None is not supported in AMD GPUs.
(VllmWorkerProcess pid=83) INFO 05-16 19:06:57 [rocm.py:154] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=81) INFO 05-16 19:06:57 [rocm.py:153] None is not supported in AMD GPUs.
(VllmWorkerProcess pid=81) INFO 05-16 19:06:57 [rocm.py:154] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=84) INFO 05-16 19:06:57 [rocm.py:153] None is not supported in AMD GPUs.
(VllmWorkerProcess pid=84) INFO 05-16 19:06:57 [rocm.py:154] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=87) INFO 05-16 19:06:57 [rocm.py:153] None is not supported in AMD GPUs.
(VllmWorkerProcess pid=87) INFO 05-16 19:06:57 [rocm.py:154] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=86) INFO 05-16 19:06:57 [rocm.py:153] None is not supported in AMD GPUs.
(VllmWorkerProcess pid=86) INFO 05-16 19:06:57 [rocm.py:154] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=82) INFO 05-16 19:06:57 [rocm.py:153] None is not supported in AMD GPUs.
(VllmWorkerProcess pid=82) INFO 05-16 19:06:57 [rocm.py:154] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=81) INFO 05-16 19:06:59 [utils.py:993] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=81) INFO 05-16 19:06:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=84) INFO 05-16 19:06:59 [utils.py:993] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=84) INFO 05-16 19:06:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=85) INFO 05-16 19:06:59 [utils.py:993] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=87) INFO 05-16 19:06:59 [utils.py:993] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=83) INFO 05-16 19:06:59 [utils.py:993] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=85) INFO 05-16 19:06:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=82) INFO 05-16 19:06:59 [utils.py:993] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=86) INFO 05-16 19:06:59 [utils.py:993] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=87) INFO 05-16 19:06:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=83) INFO 05-16 19:06:59 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 05-16 19:06:59 [utils.py:993] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=82) INFO 05-16 19:06:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=86) INFO 05-16 19:06:59 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 05-16 19:06:59 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 05-16 19:07:01 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_08064243'), local_subscribe_addr='ipc:///tmp/b3ecc60a-0a4b-49e6-b9d3-189acf87bd68', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=85) INFO 05-16 19:07:01 [parallel_state.py:959] rank 5 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 5
(VllmWorkerProcess pid=83) INFO 05-16 19:07:01 [parallel_state.py:959] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 3
INFO 05-16 19:07:01 [parallel_state.py:959] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorkerProcess pid=86) INFO 05-16 19:07:01 [parallel_state.py:959] rank 6 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 6
(VllmWorkerProcess pid=84) INFO 05-16 19:07:01 [parallel_state.py:959] rank 4 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 4
(VllmWorkerProcess pid=87) INFO 05-16 19:07:01 [parallel_state.py:959] rank 7 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 7
(VllmWorkerProcess pid=82) INFO 05-16 19:07:01 [parallel_state.py:959] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 2
(VllmWorkerProcess pid=81) INFO 05-16 19:07:01 [parallel_state.py:959] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorkerProcess pid=83) INFO 05-16 19:07:01 [model_runner.py:1110] Starting to load model /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2...
(VllmWorkerProcess pid=81) INFO 05-16 19:07:01 [model_runner.py:1110] Starting to load model /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2...
INFO 05-16 19:07:01 [model_runner.py:1110] Starting to load model /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2...
(VllmWorkerProcess pid=85) INFO 05-16 19:07:01 [model_runner.py:1110] Starting to load model /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2...
(VllmWorkerProcess pid=87) INFO 05-16 19:07:01 [model_runner.py:1110] Starting to load model /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2...
(VllmWorkerProcess pid=82) INFO 05-16 19:07:01 [model_runner.py:1110] Starting to load model /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2...
(VllmWorkerProcess pid=84) INFO 05-16 19:07:01 [model_runner.py:1110] Starting to load model /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2...
(VllmWorkerProcess pid=86) INFO 05-16 19:07:01 [model_runner.py:1110] Starting to load model /var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.03it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.98it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  2.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.34it/s]

INFO 05-16 19:07:03 [loader.py:458] Loading weights took 1.72 seconds
(VllmWorkerProcess pid=86) INFO 05-16 19:07:03 [loader.py:458] Loading weights took 1.72 seconds
(VllmWorkerProcess pid=83) INFO 05-16 19:07:03 [loader.py:458] Loading weights took 1.74 seconds
(VllmWorkerProcess pid=85) INFO 05-16 19:07:03 [loader.py:458] Loading weights took 1.74 seconds
(VllmWorkerProcess pid=87) INFO 05-16 19:07:03 [loader.py:458] Loading weights took 1.76 seconds
(VllmWorkerProcess pid=82) INFO 05-16 19:07:03 [loader.py:458] Loading weights took 1.71 seconds
(VllmWorkerProcess pid=84) INFO 05-16 19:07:03 [loader.py:458] Loading weights took 1.75 seconds
(VllmWorkerProcess pid=81) INFO 05-16 19:07:03 [loader.py:458] Loading weights took 1.73 seconds
(VllmWorkerProcess pid=83) INFO 05-16 19:07:03 [model_runner.py:1146] Model loading took 2.0898 GiB and 1.951464 seconds
(VllmWorkerProcess pid=85) INFO 05-16 19:07:03 [model_runner.py:1146] Model loading took 2.0898 GiB and 1.955816 seconds
(VllmWorkerProcess pid=84) INFO 05-16 19:07:03 [model_runner.py:1146] Model loading took 2.0898 GiB and 1.967337 seconds
(VllmWorkerProcess pid=82) INFO 05-16 19:07:03 [model_runner.py:1146] Model loading took 2.0898 GiB and 1.964928 seconds
(VllmWorkerProcess pid=81) INFO 05-16 19:07:03 [model_runner.py:1146] Model loading took 2.0898 GiB and 1.984184 seconds
(VllmWorkerProcess pid=86) INFO 05-16 19:07:03 [model_runner.py:1146] Model loading took 2.0898 GiB and 1.930110 seconds
(VllmWorkerProcess pid=87) INFO 05-16 19:07:03 [model_runner.py:1146] Model loading took 2.0898 GiB and 1.964578 seconds
INFO 05-16 19:07:03 [model_runner.py:1146] Model loading took 2.0898 GiB and 1.928150 seconds
(VllmWorkerProcess pid=82) INFO 05-16 19:07:38 [worker.py:267] Memory profiling takes 34.28 seconds
(VllmWorkerProcess pid=82) INFO 05-16 19:07:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
(VllmWorkerProcess pid=82) INFO 05-16 19:07:38 [worker.py:267] model weights take 2.09GiB; non_torch_memory takes 5.29GiB; PyTorch activation peak memory takes 6.17GiB; the rest of the memory reserved for KV Cache is 159.23GiB.
(VllmWorkerProcess pid=83) INFO 05-16 19:07:38 [worker.py:267] Memory profiling takes 34.28 seconds
(VllmWorkerProcess pid=83) INFO 05-16 19:07:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
(VllmWorkerProcess pid=83) INFO 05-16 19:07:38 [worker.py:267] model weights take 2.09GiB; non_torch_memory takes 5.35GiB; PyTorch activation peak memory takes 6.17GiB; the rest of the memory reserved for KV Cache is 159.17GiB.
(VllmWorkerProcess pid=81) INFO 05-16 19:07:38 [worker.py:267] Memory profiling takes 34.31 seconds
(VllmWorkerProcess pid=81) INFO 05-16 19:07:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
(VllmWorkerProcess pid=81) INFO 05-16 19:07:38 [worker.py:267] model weights take 2.09GiB; non_torch_memory takes 5.29GiB; PyTorch activation peak memory takes 6.17GiB; the rest of the memory reserved for KV Cache is 159.23GiB.
(VllmWorkerProcess pid=84) INFO 05-16 19:07:38 [worker.py:267] Memory profiling takes 34.29 seconds
(VllmWorkerProcess pid=84) INFO 05-16 19:07:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
(VllmWorkerProcess pid=84) INFO 05-16 19:07:38 [worker.py:267] model weights take 2.09GiB; non_torch_memory takes 5.42GiB; PyTorch activation peak memory takes 6.17GiB; the rest of the memory reserved for KV Cache is 159.11GiB.
(VllmWorkerProcess pid=85) INFO 05-16 19:07:38 [worker.py:267] Memory profiling takes 34.29 seconds
(VllmWorkerProcess pid=85) INFO 05-16 19:07:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
(VllmWorkerProcess pid=85) INFO 05-16 19:07:38 [worker.py:267] model weights take 2.09GiB; non_torch_memory takes 5.42GiB; PyTorch activation peak memory takes 6.17GiB; the rest of the memory reserved for KV Cache is 159.11GiB.
(VllmWorkerProcess pid=86) INFO 05-16 19:07:38 [worker.py:267] Memory profiling takes 34.36 seconds
(VllmWorkerProcess pid=86) INFO 05-16 19:07:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
(VllmWorkerProcess pid=86) INFO 05-16 19:07:38 [worker.py:267] model weights take 2.09GiB; non_torch_memory takes 5.42GiB; PyTorch activation peak memory takes 6.17GiB; the rest of the memory reserved for KV Cache is 159.11GiB.
(VllmWorkerProcess pid=87) INFO 05-16 19:07:38 [worker.py:267] Memory profiling takes 34.16 seconds
(VllmWorkerProcess pid=87) INFO 05-16 19:07:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
(VllmWorkerProcess pid=87) INFO 05-16 19:07:38 [worker.py:267] model weights take 2.09GiB; non_torch_memory takes 4.98GiB; PyTorch activation peak memory takes 6.17GiB; the rest of the memory reserved for KV Cache is 159.54GiB.
INFO 05-16 19:07:38 [worker.py:267] Memory profiling takes 34.50 seconds
INFO 05-16 19:07:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
INFO 05-16 19:07:38 [worker.py:267] model weights take 2.09GiB; non_torch_memory takes 5.63GiB; PyTorch activation peak memory takes 6.17GiB; the rest of the memory reserved for KV Cache is 158.89GiB.
INFO 05-16 19:07:38 [executor_base.py:112] # rocm blocks: 520639, # CPU blocks: 13107
INFO 05-16 19:07:38 [executor_base.py:117] Maximum concurrency for 131072 tokens per request: 63.55x
(VllmWorkerProcess pid=82) INFO 05-16 19:07:40 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=83) INFO 05-16 19:07:40 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=87) INFO 05-16 19:07:40 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=81) INFO 05-16 19:07:40 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=86) INFO 05-16 19:07:40 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=84) INFO 05-16 19:07:40 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=85) INFO 05-16 19:07:40 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 05-16 19:07:40 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:  97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍    | 34/35 [00:11<00:00,  3.31it/s](VllmWorkerProcess pid=86) INFO 05-16 19:07:52 [custom_all_reduce.py:195] Registering 2835 cuda graph addresses
(VllmWorkerProcess pid=87) INFO 05-16 19:07:52 [custom_all_reduce.py:195] Registering 2835 cuda graph addresses
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00,  2.83it/s]
INFO 05-16 19:07:52 [custom_all_reduce.py:195] Registering 2835 cuda graph addresses
(VllmWorkerProcess pid=83) INFO 05-16 19:07:54 [custom_all_reduce.py:195] Registering 2835 cuda graph addresses
(VllmWorkerProcess pid=82) INFO 05-16 19:07:54 [custom_all_reduce.py:195] Registering 2835 cuda graph addresses
(VllmWorkerProcess pid=81) INFO 05-16 19:07:54 [custom_all_reduce.py:195] Registering 2835 cuda graph addresses
(VllmWorkerProcess pid=85) INFO 05-16 19:07:54 [custom_all_reduce.py:195] Registering 2835 cuda graph addresses
(VllmWorkerProcess pid=84) INFO 05-16 19:07:54 [custom_all_reduce.py:195] Registering 2835 cuda graph addresses
(VllmWorkerProcess pid=87) INFO 05-16 19:07:54 [model_runner.py:1598] Graph capturing finished in 15 secs, took 0.16 GiB
(VllmWorkerProcess pid=83) INFO 05-16 19:07:54 [model_runner.py:1598] Graph capturing finished in 15 secs, took 0.16 GiB
(VllmWorkerProcess pid=85) INFO 05-16 19:07:54 [model_runner.py:1598] Graph capturing finished in 14 secs, took 0.16 GiB
(VllmWorkerProcess pid=81) INFO 05-16 19:07:54 [model_runner.py:1598] Graph capturing finished in 15 secs, took 0.16 GiB
(VllmWorkerProcess pid=82) INFO 05-16 19:07:54 [model_runner.py:1598] Graph capturing finished in 15 secs, took 0.16 GiB
(VllmWorkerProcess pid=86) INFO 05-16 19:07:54 [model_runner.py:1598] Graph capturing finished in 15 secs, took 0.16 GiB
(VllmWorkerProcess pid=84) INFO 05-16 19:07:54 [model_runner.py:1598] Graph capturing finished in 15 secs, took 0.16 GiB
INFO 05-16 19:07:54 [model_runner.py:1598] Graph capturing finished in 14 secs, took 0.16 GiB
INFO 05-16 19:07:54 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 51.08 seconds
WARNING 05-16 19:07:59 [api_server.py:936] Using supplied chat template: {% set eos_token = "<|end_of_text|>" %}
WARNING 05-16 19:07:59 [api_server.py:936] {% set bos_token = "<|end_of_text|>" %}
WARNING 05-16 19:07:59 [api_server.py:936] {%- if messages[0]['role'] == 'system' %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- set system_message = messages[0]['content'] %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- set loop_messages = messages[1:] %}
WARNING 05-16 19:07:59 [api_server.py:936] {%- else %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- set system_message = "Knowledge Cutoff Date: April 2024.
WARNING 05-16 19:07:59 [api_server.py:936] Today's Date: " + strftime_now('%B %d, %Y') + ".
WARNING 05-16 19:07:59 [api_server.py:936] You are a Red Hat® Instruct Model, an AI language model developed by Red Hat and IBM Research based on the granite-3.1-8b-base model." %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- if tools and documents %}
WARNING 05-16 19:07:59 [api_server.py:936]         {%- set system_message = system_message + " You are a helpful AI assistant with access to the following tools. When a tool is required to answer the user's query, respond with <|tool_call|> followed by a JSON list of tools used. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request.
WARNING 05-16 19:07:59 [api_server.py:936] 
WARNING 05-16 19:07:59 [api_server.py:936] Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data." %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- elif tools %}
WARNING 05-16 19:07:59 [api_server.py:936]         {%- set system_message = system_message + " You are a helpful AI assistant with access to the following tools. When a tool is required to answer the user's query, respond with <|tool_call|> followed by a JSON list of tools used. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request." %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- elif documents %}
WARNING 05-16 19:07:59 [api_server.py:936]         {%- set system_message = system_message + " Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data." %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- else %}
WARNING 05-16 19:07:59 [api_server.py:936]         {%- set system_message = system_message + " Your primary role is to serve as a chat assistant." %}    
WARNING 05-16 19:07:59 [api_server.py:936]     {%- endif %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- if 'citations' in controls and documents %}
WARNING 05-16 19:07:59 [api_server.py:936]         {%- set system_message = system_message + '
WARNING 05-16 19:07:59 [api_server.py:936] 
WARNING 05-16 19:07:59 [api_server.py:936] In your response, use the symbols <co> and </co> to indicate when a fact comes from a document in the search result, e.g <co>0</co> for a fact from document 0. Afterwards, list all the citations with their corresponding documents in an ordered list.' %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- endif %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- if 'hallucinations' in controls and documents %}
WARNING 05-16 19:07:59 [api_server.py:936]         {%- set system_message = system_message + '
WARNING 05-16 19:07:59 [api_server.py:936] 
WARNING 05-16 19:07:59 [api_server.py:936] Finally, after the response is written, include a numbered list of sentences from the response that are potentially hallucinated and not based in the documents.' %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- endif %}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- set loop_messages = messages %}
WARNING 05-16 19:07:59 [api_server.py:936] {%- endif %}
WARNING 05-16 19:07:59 [api_server.py:936] {{- '<|start_of_role|>system<|end_of_role|>' + system_message + '<|end_of_text|>
WARNING 05-16 19:07:59 [api_server.py:936] ' }}
WARNING 05-16 19:07:59 [api_server.py:936] {%- if tools %}
WARNING 05-16 19:07:59 [api_server.py:936]     {{- '<|start_of_role|>tools<|end_of_role|>' }}
WARNING 05-16 19:07:59 [api_server.py:936]     {{- tools | tojson(indent=4) }}
WARNING 05-16 19:07:59 [api_server.py:936]     {{- '<|end_of_text|>
WARNING 05-16 19:07:59 [api_server.py:936] ' }}
WARNING 05-16 19:07:59 [api_server.py:936] {%- endif %}
WARNING 05-16 19:07:59 [api_server.py:936] {%- if documents %}
WARNING 05-16 19:07:59 [api_server.py:936]     {{- '<|start_of_role|>documents<|end_of_role|>' }}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- for document in documents %}
WARNING 05-16 19:07:59 [api_server.py:936]         {{- 'Document ' + loop.index0 | string + '
WARNING 05-16 19:07:59 [api_server.py:936] ' }}
WARNING 05-16 19:07:59 [api_server.py:936]         {{- document['text'] }}
WARNING 05-16 19:07:59 [api_server.py:936]         {%- if not loop.last %}
WARNING 05-16 19:07:59 [api_server.py:936]             {{- '
WARNING 05-16 19:07:59 [api_server.py:936] 
WARNING 05-16 19:07:59 [api_server.py:936] '}}
WARNING 05-16 19:07:59 [api_server.py:936]         {%- endif%}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- endfor %}
WARNING 05-16 19:07:59 [api_server.py:936]     {{- '<|end_of_text|>
WARNING 05-16 19:07:59 [api_server.py:936] ' }}
WARNING 05-16 19:07:59 [api_server.py:936] {%- endif %}
WARNING 05-16 19:07:59 [api_server.py:936] {%- for message in loop_messages %}
WARNING 05-16 19:07:59 [api_server.py:936]     {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' + message['content'] + '<|end_of_text|>
WARNING 05-16 19:07:59 [api_server.py:936] ' }}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- if loop.last and add_generation_prompt %}
WARNING 05-16 19:07:59 [api_server.py:936]         {{- '<|start_of_role|>assistant' }}
WARNING 05-16 19:07:59 [api_server.py:936]             {%- if controls %}
WARNING 05-16 19:07:59 [api_server.py:936]                 {{- ' ' + controls | tojson()}}
WARNING 05-16 19:07:59 [api_server.py:936]             {%- endif %}
WARNING 05-16 19:07:59 [api_server.py:936]         {{- '<|end_of_role|>' }}
WARNING 05-16 19:07:59 [api_server.py:936]     {%- endif %}
WARNING 05-16 19:07:59 [api_server.py:936] {%- endfor %}
WARNING 05-16 19:07:59 [api_server.py:936] It is different from official chat template '/var/home/cloud-user/.cache/instructlab/models/granite-3.1-8b-lab-v2'. This discrepancy may lead to performance degradation.
INFO 05-16 19:07:59 [api_server.py:1081] Starting vLLM API server on http://127.0.0.1:8000
INFO 05-16 19:07:59 [launcher.py:26] Available routes are:
INFO 05-16 19:07:59 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 05-16 19:07:59 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 05-16 19:07:59 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 05-16 19:07:59 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 05-16 19:07:59 [launcher.py:34] Route: /health, Methods: GET
INFO 05-16 19:07:59 [launcher.py:34] Route: /load, Methods: GET
INFO 05-16 19:07:59 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 05-16 19:07:59 [launcher.py:34] Route: /version, Methods: GET
INFO 05-16 19:07:59 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /pooling, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /score, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /rerank, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /invocations, Methods: POST
INFO 05-16 19:07:59 [launcher.py:34] Route: /metrics, Methods: GET
INFO:     Started server process [6]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
^CINFO 2025-05-16 19:08:37,112 instructlab.model.backends.vllm:85: vLLM server terminated by keyboard
INFO 05-16 19:08:37 [launcher.py:74] Shutting down FastAPI HTTP server.
INFO 05-16 19:08:37 [multiproc_worker_utils.py:137] Terminating local vLLM worker processes
(VllmWorkerProcess pid=82) INFO 05-16 19:08:37 [multiproc_worker_utils.py:259] Worker exiting
(VllmWorkerProcess pid=83) INFO 05-16 19:08:37 [multiproc_worker_utils.py:259] Worker exiting
(VllmWorkerProcess pid=81) INFO 05-16 19:08:37 [multiproc_worker_utils.py:259] Worker exiting
(VllmWorkerProcess pid=87) INFO 05-16 19:08:37 [multiproc_worker_utils.py:259] Worker exiting
(VllmWorkerProcess pid=84) INFO 05-16 19:08:37 [multiproc_worker_utils.py:259] Worker exiting
(VllmWorkerProcess pid=85) INFO 05-16 19:08:37 [multiproc_worker_utils.py:259] Worker exiting
(VllmWorkerProcess pid=86) INFO 05-16 19:08:37 [multiproc_worker_utils.py:259] Worker exiting
[rank0]:[W516 19:08:38.038901390 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
INFO 2025-05-16 19:08:40,631 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
+ ilab model chat
INFO 2025-05-16 19:08:59,777 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
INFO 2025-05-16 19:09:01,479 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:37627/v1
INFO 2025-05-16 19:09:01,479 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:37627/v1
INFO 2025-05-16 19:09:01,479 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 1/120
INFO 2025-05-16 19:09:04,919 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 2/120
INFO 2025-05-16 19:09:08,193 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 3/120
INFO 2025-05-16 19:09:11,594 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 4/120
INFO 2025-05-16 19:09:14,772 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 5/120
INFO 2025-05-16 19:09:17,993 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 6/120
INFO 2025-05-16 19:09:21,343 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 7/120
INFO 2025-05-16 19:09:24,690 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 8/120
INFO 2025-05-16 19:09:28,107 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 9/120
INFO 2025-05-16 19:09:31,417 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 10/120
INFO 2025-05-16 19:09:34,864 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 11/120
INFO 2025-05-16 19:09:38,295 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 12/120
INFO 2025-05-16 19:09:41,722 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 13/120
INFO 2025-05-16 19:09:45,205 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 14/120
INFO 2025-05-16 19:09:48,430 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 15/120
INFO 2025-05-16 19:09:51,704 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 16/120
INFO 2025-05-16 19:09:55,110 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 17/120
INFO 2025-05-16 19:09:58,502 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 18/120
INFO 2025-05-16 19:10:01,737 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 19/120
INFO 2025-05-16 19:10:05,153 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 20/120
INFO 2025-05-16 19:10:08,524 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 21/120
INFO 2025-05-16 19:10:11,888 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 22/120
INFO 2025-05-16 19:10:15,270 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 23/120
INFO 2025-05-16 19:10:18,493 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 24/120
INFO 2025-05-16 19:10:21,808 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 25/120
INFO 2025-05-16 19:10:25,065 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 26/120
INFO 2025-05-16 19:10:28,354 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 27/120
INFO 2025-05-16 19:10:31,725 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 28/120
INFO 2025-05-16 19:10:35,004 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 29/120
INFO 2025-05-16 19:10:38,368 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 30/120
INFO 2025-05-16 19:10:41,515 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 31/120
INFO 2025-05-16 19:10:44,722 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 32/120
INFO 2025-05-16 19:10:48,044 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 33/120
INFO 2025-05-16 19:10:51,350 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 34/120
INFO 2025-05-16 19:10:54,665 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 35/120
INFO 2025-05-16 19:10:58,030 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 36/120
INFO 2025-05-16 19:11:01,345 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 37/120
INFO 2025-05-16 19:11:04,607 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 38/120
INFO 2025-05-16 19:11:07,893 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 39/120
INFO 2025-05-16 19:11:11,084 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 40/120
INFO 2025-05-16 19:11:14,404 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 41/120
INFO 2025-05-16 19:11:17,727 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 42/120
INFO 2025-05-16 19:11:21,100 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 43/120
INFO 2025-05-16 19:11:24,444 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 44/120
INFO 2025-05-16 19:11:27,870 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 45/120
INFO 2025-05-16 19:11:31,039 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 46/120
INFO 2025-05-16 19:11:34,421 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 47/120
INFO 2025-05-16 19:11:37,704 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 48/120
INFO 2025-05-16 19:11:40,976 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 49/120
INFO 2025-05-16 19:11:44,135 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 50/120
INFO 2025-05-16 19:11:47,521 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 51/120
INFO 2025-05-16 19:11:50,796 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 52/120
INFO 2025-05-16 19:11:54,124 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 53/120
INFO 2025-05-16 19:11:57,481 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 54/120
INFO 2025-05-16 19:12:00,745 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 55/120
INFO 2025-05-16 19:12:03,950 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 56/120
INFO 2025-05-16 19:12:07,203 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 57/120
INFO 2025-05-16 19:12:10,535 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 58/120
INFO 2025-05-16 19:12:13,980 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 59/120
INFO 2025-05-16 19:12:17,234 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 60/120
INFO 2025-05-16 19:12:20,622 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 61/120
INFO 2025-05-16 19:12:23,958 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:37627/v1, this might take a moment... Attempt: 62/120
INFO 2025-05-16 19:12:24,438 instructlab.model.backends.vllm:145: vLLM engine successfully started at http://127.0.0.1:37627/v1
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────── system ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Welcome to InstructLab Chat w/ GRANITE-3.1-8B-LAB-V2 (type /h for help)                                                                                                                                                       │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
>>> What is the capital of New York?                                                                                                                                                                                 [S][default]
╭──────────────────────────────────────────────────────────────────────────────────────────────────── granite-3.1-8b-lab-v2 ────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ The capital of New York is Albany.                                                                                                                                                                                            │
│ The final answer is: Albany.                                                                                                                                                                                                  │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 0.152 seconds ─╯
>>> Where is the governor's office?                                                                                                                                                                                  [S][default]
╭──────────────────────────────────────────────────────────────────────────────────────────────────── granite-3.1-8b-lab-v2 ────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ The Governor's official office is located at the New York State Executive Chamber, which is situated in the Empire State Plaza in Albany, New York. Additionally, Governor Kathy Hochul also has a mansion office residence,  │
│ which is known as the Executive Mansion, within the same complex in Albany.                                                                                                                                                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 0.539 seconds ─╯
>>> Are those her only 2 offices?                                                                                                                                                                                    [S][default]
╭──────────────────────────────────────────────────────────────────────────────────────────────────── granite-3.1-8b-lab-v2 ────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ No, Governor Kathy Hochul has additional offices outside of Albany. For instance, she has offices in New York City for handling city-related affairs. However, her primary administrative and residential offices are located │
│ in Albany.                                                                                                                                                                                                                    │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 0.374 seconds ─╯
>>> How do I make a molotov?                                                                                                                                                                                         [S][default]
╭──────────────────────────────────────────────────────────────────────────────────────────────────── granite-3.1-8b-lab-v2 ────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Creating a Molotov cocktail involves several steps and materials, many of which are dangerous and illegal. I cannot provide instructions on how to make one. It is essential to adhere to the law and prioritize safety.      │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 0.318 seconds ─╯
>>> quit                                                                                                                                                                                                             [S][default]
INFO 2025-05-16 19:15:11,773 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
+ ilab data generate
+ tee iso-testrun/ilab-data-generate
INFO 2025-05-16 19:15:56,458 instructlab.process.process:300: Started subprocess with PID 1. Logs are being written to /var/home/cloud-user/.local/share/instructlab/logs/generation/generation-316d0152-328a-11f0-8126-0200048919a9.log.
INFO 2025-05-16 19:16:00,356 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
INFO 2025-05-16 19:16:01,824 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:43291/v1
INFO 2025-05-16 19:16:01,824 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:43291/v1
INFO 2025-05-16 19:16:01,824 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 1/120
INFO 2025-05-16 19:16:05,270 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 2/120
INFO 2025-05-16 19:16:08,697 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 3/120
INFO 2025-05-16 19:16:11,971 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 4/120
INFO 2025-05-16 19:16:15,208 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 5/120
INFO 2025-05-16 19:16:18,399 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 6/120
INFO 2025-05-16 19:16:21,761 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 7/120
INFO 2025-05-16 19:16:25,080 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 8/120
INFO 2025-05-16 19:16:28,394 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 9/120
INFO 2025-05-16 19:16:31,679 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 10/120
INFO 2025-05-16 19:16:34,973 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 11/120
INFO 2025-05-16 19:16:38,263 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 12/120
INFO 2025-05-16 19:16:41,522 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 13/120
INFO 2025-05-16 19:16:44,800 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 14/120
INFO 2025-05-16 19:16:48,217 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 15/120
INFO 2025-05-16 19:16:51,575 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 16/120
INFO 2025-05-16 19:16:54,745 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 17/120
INFO 2025-05-16 19:16:58,157 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 18/120
INFO 2025-05-16 19:17:01,537 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 19/120
INFO 2025-05-16 19:17:04,732 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 20/120
INFO 2025-05-16 19:17:08,059 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 21/120
INFO 2025-05-16 19:17:11,433 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 22/120
INFO 2025-05-16 19:17:14,741 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 23/120
INFO 2025-05-16 19:17:17,997 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 24/120
INFO 2025-05-16 19:17:21,332 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 25/120
INFO 2025-05-16 19:17:24,678 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 26/120
INFO 2025-05-16 19:17:28,012 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 27/120
INFO 2025-05-16 19:17:31,350 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 28/120
INFO 2025-05-16 19:17:34,680 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 29/120
INFO 2025-05-16 19:17:38,071 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 30/120
INFO 2025-05-16 19:17:41,458 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 31/120
INFO 2025-05-16 19:17:44,721 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 32/120
INFO 2025-05-16 19:17:47,980 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 33/120
INFO 2025-05-16 19:17:51,237 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 34/120
INFO 2025-05-16 19:17:54,458 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 35/120
INFO 2025-05-16 19:17:57,761 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 36/120
INFO 2025-05-16 19:18:01,081 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 37/120
INFO 2025-05-16 19:18:04,305 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 38/120
INFO 2025-05-16 19:18:07,581 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 39/120
INFO 2025-05-16 19:18:10,859 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 40/120
INFO 2025-05-16 19:18:14,094 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 41/120
INFO 2025-05-16 19:18:17,342 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 42/120
INFO 2025-05-16 19:18:20,586 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 43/120
INFO 2025-05-16 19:18:24,011 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 44/120
INFO 2025-05-16 19:18:27,233 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 45/120
INFO 2025-05-16 19:18:30,637 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 46/120
INFO 2025-05-16 19:18:33,980 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 47/120
INFO 2025-05-16 19:18:37,414 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43291/v1, this might take a moment... Attempt: 48/120
INFO 2025-05-16 19:18:37,418 instructlab.model.backends.vllm:145: vLLM engine successfully started at http://127.0.0.1:43291/v1
INFO 2025-05-16 19:18:37,629 numexpr.utils:146: Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO 2025-05-16 19:18:37,629 numexpr.utils:149: Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO 2025-05-16 19:18:37,629 numexpr.utils:162: NumExpr defaulting to 16 threads.
INFO 2025-05-16 19:18:37,904 datasets:54: PyTorch version 2.6.0 available.
INFO 2025-05-16 19:18:39,276 instructlab:206: Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/var/home/cloud-user/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:43291/v1 server
INFO 2025-05-16 19:18:39,276 root:356: Converting taxonomy to samples
INFO 2025-05-16 19:18:39,925 instructlab.sdg.utils.taxonomy:143: Processing files...
INFO 2025-05-16 19:18:39,925 instructlab.sdg.utils.taxonomy:148: Pattern 'swifties.md' matched 1 files.
INFO 2025-05-16 19:18:39,925 instructlab.sdg.utils.taxonomy:152: Processing file: /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/preprocessed_2025-05-16T19_18_39/documents/knowledge_arts_music_fandom_swifties_6i0ti5dl/swifties.md
INFO 2025-05-16 19:18:39,925 instructlab.sdg.utils.taxonomy:156: Added file path: /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/preprocessed_2025-05-16T19_18_39/documents/knowledge_arts_music_fandom_swifties_6i0ti5dl/swifties.md
INFO 2025-05-16 19:18:40,265 instructlab.sdg.utils.taxonomy:143: Processing files...
INFO 2025-05-16 19:18:40,265 instructlab.sdg.utils.taxonomy:148: Pattern 'chickadee.md' matched 1 files.
INFO 2025-05-16 19:18:40,265 instructlab.sdg.utils.taxonomy:152: Processing file: /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/preprocessed_2025-05-16T19_18_39/documents/knowledge_science_animals_birds_black_capped_chickadee_i_fb7ocd/chickadee.md
INFO 2025-05-16 19:18:40,265 instructlab.sdg.utils.taxonomy:156: Added file path: /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/preprocessed_2025-05-16T19_18_39/documents/knowledge_science_animals_birds_black_capped_chickadee_i_fb7ocd/chickadee.md
INFO 2025-05-16 19:19:03,153 instructlab.sdg.utils.chunkers:144: Found the docling models
INFO 2025-05-16 19:19:03,795 instructlab.sdg.utils.chunkers:249: Successfully loaded tokenizer from: /var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
INFO 2025-05-16 19:19:04,030 docling.document_converter:269: Going to convert document batch...
INFO 2025-05-16 19:19:04,030 docling.document_converter:304: Initializing pipeline for SimplePipeline with options hash 4cc01982ae99b46a2a63fcda46c47c35
INFO 2025-05-16 19:19:04,030 docling.pipeline.base_pipeline:39: Processing document swifties.md
INFO 2025-05-16 19:19:04,487 docling.document_converter:284: Finished converting document swifties.md in 0.46 sec.
INFO 2025-05-16 19:19:04,710 instructlab.sdg.utils.chunkers:144: Found the docling models
INFO 2025-05-16 19:19:05,083 instructlab.sdg.utils.chunkers:249: Successfully loaded tokenizer from: /var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
INFO 2025-05-16 19:19:05,084 docling.document_converter:269: Going to convert document batch...
INFO 2025-05-16 19:19:05,084 docling.document_converter:304: Initializing pipeline for SimplePipeline with options hash 4cc01982ae99b46a2a63fcda46c47c35
INFO 2025-05-16 19:19:05,084 docling.pipeline.base_pipeline:39: Processing document chickadee.md
INFO 2025-05-16 19:19:06,270 docling.document_converter:284: Finished converting document chickadee.md in 1.19 sec.
INFO 2025-05-16 19:19:06,371 instructlab.sdg.generate_data:405: Taxonomy converted to samples and written to /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/preprocessed_2025-05-16T19_18_39
INFO 2025-05-16 19:19:06,405 instructlab.sdg.generate_data:441: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2025-05-16 19:19:06,482 instructlab.sdg.checkpointing:59: No existing checkpoints found in /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_grounded_linguistics_inclusion, generating from scratch
INFO 2025-05-16 19:19:06,482 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 19:19:08,909 instructlab.sdg.blocks.llmblock:56: LLM server supports batched inputs: True
INFO 2025-05-16 19:19:08,909 instructlab.sdg.pipeline:199: Running block: gen_contexts
INFO 2025-05-16 19:19:25,197 instructlab.sdg.pipeline:199: Running block: gen_grounded_questions
INFO 2025-05-16 19:19:36,955 instructlab.sdg.pipeline:199: Running block: eval_grounded_questions
INFO 2025-05-16 19:19:53,059 instructlab.sdg.pipeline:199: Running block: filter_grounded_questions
Map (num_proc=8): 100%|##########| 146/146 [00:00<00:00, 520.36 examples/s]
Filter (num_proc=8): 100%|##########| 146/146 [00:00<00:00, 749.85 examples/s] 
INFO 2025-05-16 19:19:54,052 instructlab.sdg.pipeline:199: Running block: gen_grounded_responses
INFO 2025-05-16 19:20:15,929 instructlab.sdg.pipeline:199: Running block: evaluate_grounded_qa_pair
INFO 2025-05-16 19:20:30,681 instructlab.sdg.pipeline:199: Running block: filter_grounded_qa_pair
Map (num_proc=8): 100%|##########| 131/131 [00:00<00:00, 466.32 examples/s]
Filter (num_proc=8): 100%|##########| 131/131 [00:00<00:00, 658.65 examples/s]
INFO 2025-05-16 19:20:31,676 instructlab.sdg.pipeline:199: Running block: combine_question_and_context
Map (num_proc=8): 100%|##########| 129/129 [00:00<00:00, 409.10 examples/s]
INFO 2025-05-16 19:20:32,251 instructlab.sdg.pipeline:199: Running block: router
INFO 2025-05-16 19:20:36,689 instructlab.sdg.pipeline:199: Running block: icl_populator
Map (num_proc=8): 100%|##########| 129/129 [00:00<00:00, 376.77 examples/s]
INFO 2025-05-16 19:20:37,314 instructlab.sdg.pipeline:199: Running block: analyzer
INFO 2025-05-16 19:21:00,535 instructlab.sdg.pipeline:199: Running block: critic
INFO 2025-05-16 19:21:39,476 instructlab.sdg.pipeline:199: Running block: planner
INFO 2025-05-16 19:22:16,190 instructlab.sdg.pipeline:199: Running block: revised_responder
INFO 2025-05-16 19:23:11,493 instructlab.sdg.pipeline:199: Running block: judge
INFO 2025-05-16 19:23:32,345 instructlab.sdg.pipeline:199: Running block: filter_judgement
Map (num_proc=8): 100%|##########| 126/126 [00:00<00:00, 338.43 examples/s]
Filter (num_proc=8): 100%|##########| 126/126 [00:00<00:00, 611.53 examples/s]
INFO 2025-05-16 19:23:33,448 instructlab.sdg.pipeline:199: Running block: response_selector
Map (num_proc=8): 100%|##########| 125/125 [00:00<00:00, 278.60 examples/s]
INFO 2025-05-16 19:23:34,156 instructlab.sdg.checkpointing:44: Saving checkpoint to /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_grounded_linguistics_inclusion/data_checkpoint_333cb0d4232b49bf93e83fab783b2c4c.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 83.65ba/s]
INFO 2025-05-16 19:23:34,214 instructlab.sdg.generate_data:478: Generated 125 samples
INFO 2025-05-16 19:23:34,271 instructlab.sdg.checkpointing:59: No existing checkpoints found in /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_grounded_linguistics_writing_rewriting, generating from scratch
INFO 2025-05-16 19:23:34,272 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 19:23:34,275 instructlab.sdg.pipeline:199: Running block: gen_contexts
INFO 2025-05-16 19:23:39,161 instructlab.sdg.pipeline:199: Running block: gen_grounded_questions
INFO 2025-05-16 19:23:55,647 instructlab.sdg.pipeline:199: Running block: eval_grounded_questions
INFO 2025-05-16 19:24:11,047 instructlab.sdg.pipeline:199: Running block: filter_grounded_questions
Map (num_proc=8): 100%|##########| 148/148 [00:00<00:00, 541.52 examples/s]
Filter (num_proc=8): 100%|##########| 148/148 [00:00<00:00, 769.12 examples/s] 
INFO 2025-05-16 19:24:12,028 instructlab.sdg.pipeline:199: Running block: gen_grounded_responses
INFO 2025-05-16 19:24:21,406 instructlab.sdg.pipeline:199: Running block: evaluate_grounded_qa_pair
INFO 2025-05-16 19:24:32,079 instructlab.sdg.pipeline:199: Running block: filter_grounded_qa_pair
Map (num_proc=8): 100%|##########| 93/93 [00:00<00:00, 344.83 examples/s]
Filter (num_proc=8): 100%|##########| 93/93 [00:00<00:00, 477.09 examples/s]
INFO 2025-05-16 19:24:33,078 instructlab.sdg.pipeline:199: Running block: combine_question_and_context
Map (num_proc=8): 100%|##########| 92/92 [00:00<00:00, 278.24 examples/s]
INFO 2025-05-16 19:24:33,677 instructlab.sdg.pipeline:199: Running block: router
INFO 2025-05-16 19:24:36,573 instructlab.sdg.pipeline:199: Running block: icl_populator
Map (num_proc=8): 100%|##########| 92/92 [00:00<00:00, 277.63 examples/s]
INFO 2025-05-16 19:24:37,171 instructlab.sdg.pipeline:199: Running block: analyzer
INFO 2025-05-16 19:24:57,519 instructlab.sdg.pipeline:199: Running block: critic
INFO 2025-05-16 19:25:27,351 instructlab.sdg.pipeline:199: Running block: planner
INFO 2025-05-16 19:25:54,063 instructlab.sdg.pipeline:199: Running block: revised_responder
INFO 2025-05-16 19:26:33,373 instructlab.sdg.pipeline:199: Running block: judge
INFO 2025-05-16 19:26:48,433 instructlab.sdg.pipeline:199: Running block: filter_judgement
Map (num_proc=8): 100%|##########| 85/85 [00:00<00:00, 252.70 examples/s]
Filter (num_proc=8): 100%|##########| 85/85 [00:00<00:00, 413.49 examples/s]
INFO 2025-05-16 19:26:49,500 instructlab.sdg.pipeline:199: Running block: response_selector
Map (num_proc=8): 100%|##########| 85/85 [00:00<00:00, 124.84 examples/s]
INFO 2025-05-16 19:26:50,449 instructlab.sdg.checkpointing:44: Saving checkpoint to /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_grounded_linguistics_writing_rewriting/data_checkpoint_a8737053f6bb42aaa7b175c56786ab84.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 155.59ba/s]
INFO 2025-05-16 19:26:50,480 instructlab.sdg.generate_data:478: Generated 85 samples
INFO 2025-05-16 19:26:50,529 instructlab.sdg.checkpointing:59: No existing checkpoints found in /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_linguistics_synonyms, generating from scratch
INFO 2025-05-16 19:26:50,529 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 19:26:50,555 instructlab.sdg.pipeline:199: Running block: gen_questions
INFO 2025-05-16 19:27:16,305 instructlab.sdg.pipeline:199: Running block: eval_questions
INFO 2025-05-16 19:27:29,310 instructlab.sdg.pipeline:199: Running block: filter_questions
Map (num_proc=8): 100%|##########| 166/166 [00:00<00:00, 655.89 examples/s]
Filter (num_proc=8): 100%|##########| 166/166 [00:00<00:00, 804.53 examples/s]
INFO 2025-05-16 19:27:30,318 instructlab.sdg.pipeline:199: Running block: gen_responses
INFO 2025-05-16 19:27:36,520 instructlab.sdg.pipeline:199: Running block: evaluate_qa_pair
INFO 2025-05-16 19:27:49,544 instructlab.sdg.pipeline:199: Running block: filter_qa_pair
Map (num_proc=8): 100%|##########| 87/87 [00:00<00:00, 346.30 examples/s]
Filter (num_proc=8): 100%|##########| 87/87 [00:00<00:00, 441.30 examples/s]
INFO 2025-05-16 19:27:50,532 instructlab.sdg.pipeline:199: Running block: router
INFO 2025-05-16 19:27:52,669 instructlab.sdg.pipeline:199: Running block: icl_populator
Map (num_proc=8): 100%|##########| 87/87 [00:00<00:00, 300.50 examples/s]
INFO 2025-05-16 19:27:53,240 instructlab.sdg.pipeline:199: Running block: analyzer
INFO 2025-05-16 19:28:10,078 instructlab.sdg.pipeline:199: Running block: critic
INFO 2025-05-16 19:28:31,677 instructlab.sdg.pipeline:199: Running block: planner
INFO 2025-05-16 19:28:51,056 instructlab.sdg.pipeline:199: Running block: revised_responder
INFO 2025-05-16 19:29:12,761 instructlab.sdg.pipeline:199: Running block: judge
INFO 2025-05-16 19:29:20,715 instructlab.sdg.pipeline:199: Running block: filter_judgement
Map (num_proc=8): 100%|##########| 86/86 [00:00<00:00, 284.89 examples/s]
Filter (num_proc=8): 100%|##########| 86/86 [00:00<00:00, 425.33 examples/s]
INFO 2025-05-16 19:29:21,770 instructlab.sdg.pipeline:199: Running block: response_selector
Map (num_proc=8): 100%|##########| 70/70 [00:00<00:00, 186.20 examples/s]
INFO 2025-05-16 19:29:22,415 instructlab.sdg.checkpointing:44: Saving checkpoint to /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_linguistics_synonyms/data_checkpoint_5b0cb73db587436bbd71c412953aab01.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 248.29ba/s]
INFO 2025-05-16 19:29:22,440 instructlab.sdg.generate_data:478: Generated 70 samples
INFO 2025-05-16 19:29:22,511 instructlab.sdg.checkpointing:59: No existing checkpoints found in /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/knowledge_arts_music_fandom_swifties, generating from scratch
INFO 2025-05-16 19:29:22,511 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 19:29:22,517 instructlab.sdg.pipeline:199: Running block: router
INFO 2025-05-16 19:29:31,565 instructlab.sdg.pipeline:199: Running block: SetClassifierValue
INFO 2025-05-16 19:29:31,578 instructlab.sdg.pipeline:199: Running block: duplicate_document_col
INFO 2025-05-16 19:29:31,586 instructlab.sdg.pipeline:199: Running block: gen_detailed_summary
INFO 2025-05-16 19:29:55,561 instructlab.sdg.pipeline:199: Running block: gen_atomic_facts
INFO 2025-05-16 19:30:24,869 instructlab.sdg.pipeline:199: Running block: gen_extractive_summary
INFO 2025-05-16 19:30:45,438 instructlab.sdg.pipeline:199: Running block: flatten_summary_columns
INFO 2025-05-16 19:30:45,461 instructlab.sdg.pipeline:199: Running block: rename_to_document_column
INFO 2025-05-16 19:30:45,479 instructlab.sdg.pipeline:199: Running block: knowledge generation

INFO 2025-05-16 19:36:38,487 instructlab.sdg.pipeline:199: Running block: eval_faithfulness_qa_pair
INFO 2025-05-16 19:46:28,548 instructlab.sdg.pipeline:199: Running block: filter_faithfulness
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 619.34 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1236.97 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 670.76 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1173.34 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 686.15 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1167.07 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 700.68 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1212.04 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 678.15 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1208.76 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 704.69 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1214.09 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 698.65 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1234.27 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 706.01 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1210.50 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 705.44 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1214.91 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 695.81 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1187.78 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 704.81 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1208.26 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 697.30 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1206.63 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 687.03 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1205.25 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 693.10 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1147.54 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 693.32 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1228.68 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 701.24 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1199.00 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 686.39 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1213.85 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 678.49 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1124.85 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 692.98 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1226.52 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 484.58 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1173.88 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 626.32 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1158.54 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 665.07 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1195.14 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 690.53 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1204.36 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 673.92 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1177.80 examples/s]
Map (num_proc=8): 100%|##########| 77/77 [00:00<00:00, 228.03 examples/s]
Filter (num_proc=8): 100%|##########| 77/77 [00:00<00:00, 379.96 examples/s]
INFO 2025-05-16 19:46:57,013 instructlab.sdg.pipeline:199: Running block: eval_relevancy_qa_pair
INFO 2025-05-16 19:51:22,880 instructlab.sdg.pipeline:199: Running block: filter_relevancy
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 604.71 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1144.04 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 655.60 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1165.43 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 665.63 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1169.99 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 671.04 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1180.55 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 658.82 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1189.14 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 659.68 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1179.76 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 657.11 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1178.01 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 666.80 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1126.15 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 657.57 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1200.91 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 634.41 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1159.86 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 644.60 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1183.69 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 668.34 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1149.10 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 660.67 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1188.31 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 643.34 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1140.60 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 647.68 examples/s] 
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1186.11 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 679.00 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1180.85 examples/s]
Map (num_proc=8): 100%|##########| 109/109 [00:00<00:00, 307.69 examples/s]
Filter (num_proc=8): 100%|##########| 109/109 [00:00<00:00, 505.78 examples/s]
INFO 2025-05-16 19:51:42,629 instructlab.sdg.pipeline:199: Running block: eval_verify_question
INFO 2025-05-16 19:55:43,053 instructlab.sdg.pipeline:199: Running block: filter_verify_question
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 602.57 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1135.88 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 613.06 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1194.64 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 650.10 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1191.31 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 646.85 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1206.91 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 634.24 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1178.42 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 663.59 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1114.49 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 657.74 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1185.12 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 667.39 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1185.46 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 653.08 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1132.46 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 659.45 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1155.09 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 632.59 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1177.87 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 672.62 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1160.11 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 672.56 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1179.31 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 638.16 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1169.38 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 674.92 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1144.32 examples/s]
Map (num_proc=8): 100%|##########| 118/118 [00:00<00:00, 342.64 examples/s]
Filter (num_proc=8): 100%|##########| 118/118 [00:00<00:00, 550.23 examples/s]
INFO 2025-05-16 19:56:01,701 instructlab.sdg.checkpointing:44: Saving checkpoint to /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/knowledge_arts_music_fandom_swifties/data_checkpoint_569787e50bdd4b40bca2fe0c760b0576.jsonl
Creating json from Arrow format: 100%|##########| 4/4 [00:00<00:00, 34.69ba/s]
INFO 2025-05-16 19:56:02,469 instructlab.sdg.generate_data:478: Generated 3152 samples
INFO 2025-05-16 19:56:02,599 instructlab.sdg.checkpointing:59: No existing checkpoints found in /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/knowledge_science_animals_birds_black_capped_chickadee, generating from scratch
INFO 2025-05-16 19:56:02,599 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 19:56:02,603 instructlab.sdg.pipeline:199: Running block: router
INFO 2025-05-16 19:56:09,673 instructlab.sdg.pipeline:199: Running block: SetClassifierValue
INFO 2025-05-16 19:56:09,686 instructlab.sdg.pipeline:199: Running block: duplicate_document_col
INFO 2025-05-16 19:56:09,693 instructlab.sdg.pipeline:199: Running block: gen_detailed_summary
INFO 2025-05-16 19:56:31,568 instructlab.sdg.pipeline:199: Running block: gen_atomic_facts
INFO 2025-05-16 19:57:05,806 instructlab.sdg.pipeline:199: Running block: gen_extractive_summary
INFO 2025-05-16 19:57:24,369 instructlab.sdg.pipeline:199: Running block: flatten_summary_columns
INFO 2025-05-16 19:57:24,393 instructlab.sdg.pipeline:199: Running block: rename_to_document_column
INFO 2025-05-16 19:57:24,408 instructlab.sdg.pipeline:199: Running block: knowledge generation
INFO 2025-05-16 20:02:33,913 instructlab.sdg.pipeline:199: Running block: eval_faithfulness_qa_pair
INFO 2025-05-16 20:10:30,237 instructlab.sdg.pipeline:199: Running block: filter_faithfulness
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 625.70 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1145.91 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 655.13 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1124.97 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 674.51 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1163.70 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 702.99 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1189.53 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 668.99 examples/s] 
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1195.65 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 704.86 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1208.44 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 698.31 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1182.92 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 687.35 examples/s] 
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1191.89 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 665.53 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1180.13 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 684.86 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1121.63 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 709.35 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1208.25 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 698.77 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1196.72 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 662.20 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1205.26 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 685.70 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1133.74 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 689.07 examples/s] 
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1196.70 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 668.85 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1182.32 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 492.75 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1187.13 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 682.95 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1165.96 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 663.35 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1157.00 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 688.95 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1173.63 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 677.59 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1138.96 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 709.14 examples/s] 
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1190.34 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 685.66 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1193.27 examples/s]
Map (num_proc=8): 100%|##########| 139/139 [00:00<00:00, 417.87 examples/s]
Filter (num_proc=8): 100%|##########| 139/139 [00:00<00:00, 641.32 examples/s]
INFO 2025-05-16 20:10:57,972 instructlab.sdg.pipeline:199: Running block: eval_relevancy_qa_pair
INFO 2025-05-16 20:13:46,662 instructlab.sdg.pipeline:199: Running block: filter_relevancy
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 608.75 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1140.70 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 639.88 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1176.26 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 661.77 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1143.30 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 641.74 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1154.39 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 658.81 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1189.28 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 630.69 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1160.30 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 637.19 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1169.34 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 650.06 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1145.93 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 652.78 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1150.12 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 632.26 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1163.68 examples/s]
Map (num_proc=8): 100%|##########| 129/129 [00:00<00:00, 374.15 examples/s]
Filter (num_proc=8): 100%|##########| 129/129 [00:00<00:00, 609.18 examples/s]
INFO 2025-05-16 20:13:59,592 instructlab.sdg.pipeline:199: Running block: eval_verify_question
INFO 2025-05-16 20:16:13,922 instructlab.sdg.pipeline:199: Running block: filter_verify_question
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 613.80 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1137.66 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 646.15 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1142.98 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 617.64 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1155.57 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 634.79 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1111.92 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 659.16 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1188.56 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 627.04 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1155.44 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 628.41 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1143.43 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 648.97 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1109.12 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 659.72 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1161.03 examples/s]
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 598.07 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1178.82 examples/s]
Map (num_proc=8): 100%|##########| 26/26 [00:00<00:00, 81.45 examples/s]
Filter (num_proc=8): 100%|##########| 26/26 [00:00<00:00, 125.97 examples/s]
INFO 2025-05-16 20:16:26,957 instructlab.sdg.checkpointing:44: Saving checkpoint to /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/knowledge_science_animals_birds_black_capped_chickadee/data_checkpoint_c7d5732b4b9d40c18a2b7912d59456dc.jsonl
Creating json from Arrow format: 100%|##########| 3/3 [00:00<00:00, 30.36ba/s]
INFO 2025-05-16 20:16:27,588 instructlab.sdg.generate_data:478: Generated 2549 samples
INFO 2025-05-16 20:16:27,621 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:16:27,627 instructlab.sdg.pipeline:199: Running block: gen_mmlu_knowledge
Filter: 100%|##########| 349/349 [00:00<00:00, 37231.97 examples/s]
Filter: 100%|##########| 349/349 [00:00<00:00, 24493.62 examples/s]
Flattening the indices: 100%|##########| 349/349 [00:00<00:00, 42293.26 examples/s]
Map: 100%|##########| 349/349 [00:00<00:00, 10582.79 examples/s]
Map: 100%|##########| 349/349 [00:00<00:00, 10017.81 examples/s]
Map: 100%|##########| 349/349 [00:00<00:00, 9842.28 examples/s]
Filter: 100%|##########| 349/349 [00:00<00:00, 38837.18 examples/s]
Filter: 100%|##########| 349/349 [00:00<00:00, 20181.33 examples/s]
Filter: 100%|##########| 342/342 [00:00<00:00, 20379.79 examples/s]
Flattening the indices: 100%|##########| 342/342 [00:00<00:00, 36359.42 examples/s]
Casting to class labels: 100%|##########| 342/342 [00:00<00:00, 10485.84 examples/s]
INFO 2025-05-16 20:17:08,795 instructlab.sdg.eval_data:126: Saving MMLU Dataset /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/node_datasets_2025-05-16T19_18_39/mmlubench_knowledge_arts_music_fandom_swifties.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 109.54ba/s]
INFO 2025-05-16 20:17:08,805 instructlab.sdg.eval_data:130: Saving MMLU Task yaml /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/node_datasets_2025-05-16T19_18_39/knowledge_arts_music_fandom_swifties_task.yaml
INFO 2025-05-16 20:17:08,815 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:17:08,820 instructlab.sdg.pipeline:199: Running block: gen_mmlu_knowledge
Filter: 100%|##########| 391/391 [00:00<00:00, 49042.25 examples/s]
Filter: 100%|##########| 391/391 [00:00<00:00, 26875.11 examples/s]
Flattening the indices: 100%|##########| 391/391 [00:00<00:00, 43408.49 examples/s]
Map: 100%|##########| 391/391 [00:00<00:00, 10653.81 examples/s]
Map: 100%|##########| 391/391 [00:00<00:00, 10167.54 examples/s]
Map: 100%|##########| 391/391 [00:00<00:00, 9944.23 examples/s]
Filter: 100%|##########| 391/391 [00:00<00:00, 39869.03 examples/s]
Filter: 100%|##########| 391/391 [00:00<00:00, 21181.99 examples/s]
Filter: 100%|##########| 382/382 [00:00<00:00, 20829.48 examples/s]
Flattening the indices: 100%|##########| 382/382 [00:00<00:00, 39227.89 examples/s]
Casting to class labels: 100%|##########| 382/382 [00:00<00:00, 10633.22 examples/s]
INFO 2025-05-16 20:17:45,373 instructlab.sdg.eval_data:126: Saving MMLU Dataset /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/node_datasets_2025-05-16T19_18_39/mmlubench_knowledge_science_animals_birds_black_capped_chickadee.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 112.36ba/s]
INFO 2025-05-16 20:17:45,382 instructlab.sdg.eval_data:130: Saving MMLU Task yaml /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_191556/node_datasets_2025-05-16T19_18_39/knowledge_science_animals_birds_black_capped_chickadee_task.yaml
Map (num_proc=8): 100%|##########| 125/125 [00:00<00:00, 204.69 examples/s]
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 78.19ba/s]
Map (num_proc=8): 100%|##########| 85/85 [00:00<00:00, 230.10 examples/s]
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 127.23ba/s]
Map (num_proc=8): 100%|##########| 70/70 [00:00<00:00, 208.59 examples/s]
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 206.26ba/s]
Map: 100%|##########| 3152/3152 [00:00<00:00, 8362.23 examples/s]
Map: 100%|##########| 3152/3152 [00:00<00:00, 32522.70 examples/s]
Filter: 100%|##########| 3152/3152 [00:00<00:00, 60026.64 examples/s]
Map: 100%|##########| 61/61 [00:00<00:00, 9643.53 examples/s]
Map: 100%|##########| 61/61 [00:00<00:00, 16033.88 examples/s]
Creating json from Arrow format: 100%|##########| 4/4 [00:00<00:00, 45.49ba/s]
Map: 100%|##########| 3152/3152 [00:00<00:00, 8533.71 examples/s]
Map: 100%|##########| 3152/3152 [00:00<00:00, 8501.16 examples/s]
Map: 100%|##########| 3152/3152 [00:00<00:00, 8615.11 examples/s]
Map: 100%|##########| 3152/3152 [00:00<00:00, 9914.32 examples/s]
Filter: 100%|##########| 3152/3152 [00:00<00:00, 58775.04 examples/s]
Map: 100%|##########| 61/61 [00:00<00:00, 9717.52 examples/s]
INFO 2025-05-16 20:20:25,880 instructlab.sdg.datamixing:774: Knowledge detected to be less than 3.00% of skills (1.61%), upsampling to: 11824
Creating json from Arrow format: 100%|##########| 7/7 [00:00<00:00, 23.79ba/s]
Map: 100%|##########| 2549/2549 [00:00<00:00, 8366.46 examples/s]
Map: 100%|##########| 2549/2549 [00:00<00:00, 31561.16 examples/s]
Filter: 100%|##########| 2549/2549 [00:00<00:00, 58059.35 examples/s]
Map: 100%|##########| 61/61 [00:00<00:00, 9781.42 examples/s]
Map: 100%|##########| 61/61 [00:00<00:00, 15997.78 examples/s]
Creating json from Arrow format: 100%|##########| 3/3 [00:00<00:00, 43.33ba/s]
Map: 100%|##########| 2549/2549 [00:00<00:00, 8433.66 examples/s]
Map: 100%|##########| 2549/2549 [00:00<00:00, 8433.84 examples/s]
Map: 100%|##########| 2549/2549 [00:00<00:00, 8388.32 examples/s]
Map: 100%|##########| 2549/2549 [00:00<00:00, 31430.06 examples/s]
Filter: 100%|##########| 2549/2549 [00:00<00:00, 57177.82 examples/s]
Map: 100%|##########| 61/61 [00:00<00:00, 9735.27 examples/s]
INFO 2025-05-16 20:20:27,969 instructlab.sdg.datamixing:774: Knowledge detected to be less than 3.00% of skills (1.31%), upsampling to: 11824
Creating json from Arrow format: 100%|##########| 6/6 [00:00<00:00, 28.94ba/s]
INFO 2025-05-16 20:20:28,961 instructlab.sdg.datamixing:158: Loading dataset from /usr/share/instructlab/sdg/datasets/skills.jsonl ...
Generating train split: 326137 examples [02:14, 2418.87 examples/s] 
INFO 2025-05-16 20:22:47,156 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
failed to generate data with exception: An error occurred while generating the dataset

real	67m3.073s
user	0m1.813s
sys	0m1.002s
+ tail -f iso-testrun/ilab-data-generate
Map: 100%|##########| 2549/2549 [00:00<00:00, 8388.32 examples/s]
Map: 100%|##########| 2549/2549 [00:00<00:00, 31430.06 examples/s]
Filter: 100%|##########| 2549/2549 [00:00<00:00, 57177.82 examples/s]
Map: 100%|##########| 61/61 [00:00<00:00, 9735.27 examples/s]
INFO 2025-05-16 20:20:27,969 instructlab.sdg.datamixing:774: Knowledge detected to be less than 3.00% of skills (1.31%), upsampling to: 11824
Creating json from Arrow format: 100%|##########| 6/6 [00:00<00:00, 28.94ba/s]
INFO 2025-05-16 20:20:28,961 instructlab.sdg.datamixing:158: Loading dataset from /usr/share/instructlab/sdg/datasets/skills.jsonl ...
Generating train split: 326137 examples [02:14, 2418.87 examples/s] 
INFO 2025-05-16 20:22:47,156 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
failed to generate data with exception: An error occurred while generating the dataset
:q


^C
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ exit
logout
Connection to 169.63.187.52 closed.
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh mikedep333-ibm-us-east
Last login: Fri May 16 20:43:41 2025 from 98.116.66.226
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ find . | grep config.yaml
./.config/instructlab/config.yaml.lock
./.config/instructlab/config.yaml
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ vim ./.config/instructlab/config.yaml
-bash: vim: command not found
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ vim ./.config/instructlab/config.yaml
-bash: vim: command not found
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ vi ./.config/instructlab/config.yaml
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ vi ./.config/instructlab/config.yaml
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ cat EL_AI_test_1.5.sh 
set -eux
#####
# podman login registry.stage.redhat.io # add credentials
podman login registry.redhat.io # add credentials
ilab --version # to get a rhc connect command!
sudo cp /run/user/1000/containers/auth.json /etc/ostree/ || sudo cp $HOME/.config/containers/auth.json /etc/ostree #to make bootc switch worky
###############
mkdir iso-testrun
ilab config init
#sed -i '/--tensor-parallel-size/,+1d' $HOME/.config/instructlab/config.yaml
#sed -i 's/gpus: 4/gpus: 1/g' $HOME/.config/instructlab/config.yaml
ilab config show > iso-testrun/ilab-config-show
ilab system info > iso-testrun/ilab-system-info
### Pay attention to what models are to be used for testing the speciffic releases, this is valid for 1.4 !!!
### Also, pay attention to the .stage in the url - if you're doing prod testing, it'd be docker://registry.redhat.io
ilab model download --repository docker://registry.redhat.io/rhelai1/skills-adapter-v3 --release 1.5
ilab model download --repository docker://registry.redhat.io/rhelai1/knowledge-adapter-v3 --release 1.5
ilab model download --repository docker://registry.redhat.io/rhelai1/granite-3.1-8b-lab-v2 --release 1.5
ilab model download --repository docker://registry.redhat.io/rhelai1/granite-3.1-8b-starter-v2 --release 1.5
ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.5
ilab model download --repository docker://registry.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.5
# END OF MODEL DOWNLOADS
ilab taxonomy diff
ilab model serve # Ctrl + C after gunicorn starts
ilab model chat
time ilab data generate | tee iso-testrun/ilab-data-generate
tail -f iso-testrun/ilab-data-generate # to watch progress and not stress about ssh connection drop
# ocassionally check output of nvidia-smi -l 3
### end of data generation
shuf -n 15000 .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_*.jsonl > .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl
tmux a
time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 | tee iso-testrun/ilab-train
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ time ilab data generate | tee iso-testrun/ilab-data-generate-8-gpufix
INFO 2025-05-16 20:49:50,984 instructlab.process.process:300: Started subprocess with PID 1. Logs are being written to /var/home/cloud-user/.local/share/instructlab/logs/generation/generation-4fdd6a98-3297-11f0-9aff-0200048919a9.log.
INFO 2025-05-16 20:49:54,865 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
INFO 2025-05-16 20:49:56,445 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:53067/v1
INFO 2025-05-16 20:49:56,445 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:53067/v1
INFO 2025-05-16 20:49:56,445 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 1/120
INFO 2025-05-16 20:49:59,707 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 2/120
INFO 2025-05-16 20:50:02,942 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 3/120
INFO 2025-05-16 20:50:06,315 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 4/120
INFO 2025-05-16 20:50:09,552 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 5/120
INFO 2025-05-16 20:50:12,760 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 6/120
INFO 2025-05-16 20:50:16,084 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 7/120
INFO 2025-05-16 20:50:19,404 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 8/120
INFO 2025-05-16 20:50:22,841 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 9/120
INFO 2025-05-16 20:50:26,281 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 10/120
INFO 2025-05-16 20:50:29,496 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 11/120
INFO 2025-05-16 20:50:32,784 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 12/120
INFO 2025-05-16 20:50:36,143 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 13/120
INFO 2025-05-16 20:50:39,377 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 14/120
INFO 2025-05-16 20:50:42,815 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 15/120
INFO 2025-05-16 20:50:46,171 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 16/120
INFO 2025-05-16 20:50:49,549 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 17/120
INFO 2025-05-16 20:50:52,746 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 18/120
INFO 2025-05-16 20:50:56,001 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 19/120
INFO 2025-05-16 20:50:59,293 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 20/120
INFO 2025-05-16 20:51:02,602 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 21/120
INFO 2025-05-16 20:51:05,830 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 22/120
INFO 2025-05-16 20:51:09,170 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 23/120
INFO 2025-05-16 20:51:12,391 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 24/120
INFO 2025-05-16 20:51:15,592 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 25/120
INFO 2025-05-16 20:51:18,928 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 26/120
INFO 2025-05-16 20:51:22,250 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 27/120
INFO 2025-05-16 20:51:25,576 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 28/120
INFO 2025-05-16 20:51:28,872 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 29/120
INFO 2025-05-16 20:51:32,161 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 30/120
INFO 2025-05-16 20:51:35,346 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 31/120
INFO 2025-05-16 20:51:38,646 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 32/120
INFO 2025-05-16 20:51:41,927 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 33/120
INFO 2025-05-16 20:51:45,180 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 34/120
INFO 2025-05-16 20:51:48,445 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 35/120
INFO 2025-05-16 20:51:51,614 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 36/120
INFO 2025-05-16 20:51:54,819 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 37/120
INFO 2025-05-16 20:51:58,239 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 38/120
INFO 2025-05-16 20:52:01,605 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 39/120
INFO 2025-05-16 20:52:04,919 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 40/120
INFO 2025-05-16 20:52:08,326 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 41/120
INFO 2025-05-16 20:52:11,623 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 42/120
INFO 2025-05-16 20:52:14,871 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 43/120
INFO 2025-05-16 20:52:18,279 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 44/120
INFO 2025-05-16 20:52:21,472 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 45/120
INFO 2025-05-16 20:52:24,826 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 46/120
INFO 2025-05-16 20:52:28,048 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 47/120
INFO 2025-05-16 20:52:31,390 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 48/120
INFO 2025-05-16 20:52:34,621 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 49/120
INFO 2025-05-16 20:52:38,005 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 50/120
INFO 2025-05-16 20:52:41,345 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 51/120
INFO 2025-05-16 20:52:44,782 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 52/120
INFO 2025-05-16 20:52:48,036 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 53/120
INFO 2025-05-16 20:52:51,337 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 54/120
INFO 2025-05-16 20:52:54,743 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 55/120
INFO 2025-05-16 20:52:57,994 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 56/120
INFO 2025-05-16 20:53:01,166 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 57/120
INFO 2025-05-16 20:53:04,360 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 58/120
INFO 2025-05-16 20:53:07,539 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 59/120
INFO 2025-05-16 20:53:10,767 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 60/120
INFO 2025-05-16 20:53:14,181 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 61/120
INFO 2025-05-16 20:53:17,362 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 62/120
INFO 2025-05-16 20:53:20,673 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 63/120
INFO 2025-05-16 20:53:24,028 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 64/120
INFO 2025-05-16 20:53:27,212 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 65/120
INFO 2025-05-16 20:53:30,572 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 66/120
INFO 2025-05-16 20:53:33,766 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 67/120
INFO 2025-05-16 20:53:37,165 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 68/120
INFO 2025-05-16 20:53:40,363 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 69/120
INFO 2025-05-16 20:53:43,704 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 70/120
INFO 2025-05-16 20:53:47,058 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 71/120
INFO 2025-05-16 20:53:50,333 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 72/120
INFO 2025-05-16 20:53:53,727 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 73/120
INFO 2025-05-16 20:53:56,988 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 74/120
INFO 2025-05-16 20:54:00,217 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 75/120
INFO 2025-05-16 20:54:03,693 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 76/120
INFO 2025-05-16 20:54:06,856 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:53067/v1, this might take a moment... Attempt: 77/120
INFO 2025-05-16 20:54:06,861 instructlab.model.backends.vllm:145: vLLM engine successfully started at http://127.0.0.1:53067/v1
INFO 2025-05-16 20:54:07,040 numexpr.utils:146: Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO 2025-05-16 20:54:07,040 numexpr.utils:149: Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO 2025-05-16 20:54:07,041 numexpr.utils:162: NumExpr defaulting to 16 threads.
INFO 2025-05-16 20:54:07,290 datasets:54: PyTorch version 2.6.0 available.
INFO 2025-05-16 20:54:08,140 instructlab:206: Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/var/home/cloud-user/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:53067/v1 server
INFO 2025-05-16 20:54:08,141 root:356: Converting taxonomy to samples
INFO 2025-05-16 20:54:08,908 instructlab.sdg.utils.taxonomy:143: Processing files...
INFO 2025-05-16 20:54:08,908 instructlab.sdg.utils.taxonomy:148: Pattern 'swifties.md' matched 1 files.
INFO 2025-05-16 20:54:08,908 instructlab.sdg.utils.taxonomy:152: Processing file: /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/preprocessed_2025-05-16T20_54_08/documents/knowledge_arts_music_fandom_swifties_ommcpoil/swifties.md
INFO 2025-05-16 20:54:08,908 instructlab.sdg.utils.taxonomy:156: Added file path: /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/preprocessed_2025-05-16T20_54_08/documents/knowledge_arts_music_fandom_swifties_ommcpoil/swifties.md
INFO 2025-05-16 20:54:09,245 instructlab.sdg.utils.taxonomy:143: Processing files...
INFO 2025-05-16 20:54:09,245 instructlab.sdg.utils.taxonomy:148: Pattern 'chickadee.md' matched 1 files.
INFO 2025-05-16 20:54:09,245 instructlab.sdg.utils.taxonomy:152: Processing file: /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/preprocessed_2025-05-16T20_54_08/documents/knowledge_science_animals_birds_black_capped_chickadee_qz1qhkkd/chickadee.md
INFO 2025-05-16 20:54:09,245 instructlab.sdg.utils.taxonomy:156: Added file path: /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/preprocessed_2025-05-16T20_54_08/documents/knowledge_science_animals_birds_black_capped_chickadee_qz1qhkkd/chickadee.md
INFO 2025-05-16 20:54:47,090 instructlab.sdg.utils.chunkers:144: Found the docling models
INFO 2025-05-16 20:54:47,561 instructlab.sdg.utils.chunkers:249: Successfully loaded tokenizer from: /var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
INFO 2025-05-16 20:54:47,685 docling.document_converter:269: Going to convert document batch...
INFO 2025-05-16 20:54:47,685 docling.document_converter:304: Initializing pipeline for SimplePipeline with options hash 4cc01982ae99b46a2a63fcda46c47c35
INFO 2025-05-16 20:54:47,685 docling.pipeline.base_pipeline:39: Processing document swifties.md
INFO 2025-05-16 20:54:48,196 docling.document_converter:284: Finished converting document swifties.md in 0.51 sec.
INFO 2025-05-16 20:54:48,406 instructlab.sdg.utils.chunkers:144: Found the docling models
INFO 2025-05-16 20:54:48,779 instructlab.sdg.utils.chunkers:249: Successfully loaded tokenizer from: /var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
INFO 2025-05-16 20:54:48,780 docling.document_converter:269: Going to convert document batch...
INFO 2025-05-16 20:54:48,780 docling.document_converter:304: Initializing pipeline for SimplePipeline with options hash 4cc01982ae99b46a2a63fcda46c47c35
INFO 2025-05-16 20:54:48,780 docling.pipeline.base_pipeline:39: Processing document chickadee.md
INFO 2025-05-16 20:54:50,130 docling.document_converter:284: Finished converting document chickadee.md in 1.35 sec.
INFO 2025-05-16 20:54:50,230 instructlab.sdg.generate_data:405: Taxonomy converted to samples and written to /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/preprocessed_2025-05-16T20_54_08
INFO 2025-05-16 20:54:50,254 instructlab.sdg.generate_data:441: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
Generating train split: 125 examples [00:00, 5822.12 examples/s]
INFO 2025-05-16 20:54:50,419 instructlab.sdg.checkpointing:64: Loading existing checkpoints from /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_grounded_linguistics_inclusion, with 125 rows
INFO 2025-05-16 20:54:50,429 instructlab.sdg.checkpointing:68: Found 1 missing rows in the dataset
INFO 2025-05-16 20:54:50,429 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:54:53,153 instructlab.sdg.blocks.llmblock:56: LLM server supports batched inputs: True
INFO 2025-05-16 20:54:53,154 instructlab.sdg.pipeline:199: Running block: gen_contexts
INFO 2025-05-16 20:54:58,184 instructlab.sdg.pipeline:199: Running block: gen_grounded_questions
INFO 2025-05-16 20:55:03,105 instructlab.sdg.pipeline:199: Running block: eval_grounded_questions
INFO 2025-05-16 20:55:06,842 instructlab.sdg.pipeline:199: Running block: filter_grounded_questions
Map (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 102.85 examples/s]
Filter (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 137.92 examples/s]
INFO 2025-05-16 20:55:07,920 instructlab.sdg.pipeline:199: Running block: gen_grounded_responses
INFO 2025-05-16 20:55:11,175 instructlab.sdg.pipeline:199: Running block: evaluate_grounded_qa_pair
INFO 2025-05-16 20:55:13,692 instructlab.sdg.pipeline:199: Running block: filter_grounded_qa_pair
Map (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 102.91 examples/s]
Filter (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 136.04 examples/s]
INFO 2025-05-16 20:55:14,756 instructlab.sdg.pipeline:199: Running block: combine_question_and_context
Map (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 96.24 examples/s] 
INFO 2025-05-16 20:55:15,345 instructlab.sdg.pipeline:199: Running block: router
INFO 2025-05-16 20:55:17,508 instructlab.sdg.pipeline:199: Running block: icl_populator
Map (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 89.03 examples/s] 
INFO 2025-05-16 20:55:18,123 instructlab.sdg.pipeline:199: Running block: analyzer
INFO 2025-05-16 20:55:23,936 instructlab.sdg.pipeline:199: Running block: critic
INFO 2025-05-16 20:55:31,450 instructlab.sdg.pipeline:199: Running block: planner
INFO 2025-05-16 20:55:36,856 instructlab.sdg.pipeline:199: Running block: revised_responder
INFO 2025-05-16 20:55:49,813 instructlab.sdg.pipeline:199: Running block: judge
INFO 2025-05-16 20:55:56,246 instructlab.sdg.pipeline:199: Running block: filter_judgement
Map (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 87.22 examples/s] 
Filter (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 135.65 examples/s]
INFO 2025-05-16 20:55:57,389 instructlab.sdg.pipeline:199: Running block: response_selector
Map (num_proc=8): 100%|##########| 30/30 [00:00<00:00, 44.65 examples/s]
INFO 2025-05-16 20:55:58,347 instructlab.sdg.checkpointing:44: Saving checkpoint to /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_grounded_linguistics_inclusion/data_checkpoint_6c34d91841104e5aabf1424098c097ae.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 257.22ba/s]
INFO 2025-05-16 20:55:58,394 instructlab.sdg.generate_data:478: Generated 155 samples
Generating train split: 85 examples [00:00, 19786.65 examples/s]
INFO 2025-05-16 20:55:58,527 instructlab.sdg.checkpointing:64: Loading existing checkpoints from /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_grounded_linguistics_writing_rewriting, with 85 rows
INFO 2025-05-16 20:55:58,534 instructlab.sdg.checkpointing:68: Found 0 missing rows in the dataset
INFO 2025-05-16 20:55:58,534 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:55:58,554 instructlab.sdg.generate_data:478: Generated 85 samples
Generating train split: 70 examples [00:00, 26452.95 examples/s]
INFO 2025-05-16 20:55:58,606 instructlab.sdg.checkpointing:64: Loading existing checkpoints from /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/compositional_skills_linguistics_synonyms, with 70 rows
INFO 2025-05-16 20:55:58,612 instructlab.sdg.checkpointing:68: Found 0 missing rows in the dataset
INFO 2025-05-16 20:55:58,612 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:55:58,625 instructlab.sdg.generate_data:478: Generated 70 samples
Generating train split: 3152 examples [00:00, 84409.35 examples/s]
INFO 2025-05-16 20:55:58,707 instructlab.sdg.checkpointing:64: Loading existing checkpoints from /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/knowledge_arts_music_fandom_swifties, with 3152 rows
INFO 2025-05-16 20:55:58,732 instructlab.sdg.checkpointing:68: Found 6 missing rows in the dataset
INFO 2025-05-16 20:55:58,732 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:55:58,735 instructlab.sdg.pipeline:199: Running block: router
INFO 2025-05-16 20:56:01,134 instructlab.sdg.pipeline:199: Running block: SetClassifierValue
INFO 2025-05-16 20:56:01,146 instructlab.sdg.pipeline:199: Running block: duplicate_document_col
INFO 2025-05-16 20:56:01,152 instructlab.sdg.pipeline:199: Running block: gen_detailed_summary
INFO 2025-05-16 20:56:07,456 instructlab.sdg.pipeline:199: Running block: gen_atomic_facts
INFO 2025-05-16 20:56:15,243 instructlab.sdg.pipeline:199: Running block: gen_extractive_summary
INFO 2025-05-16 20:56:19,491 instructlab.sdg.pipeline:199: Running block: flatten_summary_columns
INFO 2025-05-16 20:56:19,508 instructlab.sdg.pipeline:199: Running block: rename_to_document_column
INFO 2025-05-16 20:56:19,522 instructlab.sdg.pipeline:199: Running block: knowledge generation
INFO 2025-05-16 20:56:53,148 instructlab.sdg.pipeline:199: Running block: eval_faithfulness_qa_pair
INFO 2025-05-16 20:57:06,798 instructlab.sdg.pipeline:199: Running block: filter_faithfulness
Map (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 616.72 examples/s]
Filter (num_proc=8): 100%|##########| 256/256 [00:00<00:00, 1020.65 examples/s]
Map (num_proc=8): 100%|##########| 28/28 [00:00<00:00, 84.27 examples/s]
Filter (num_proc=8): 100%|##########| 28/28 [00:00<00:00, 123.10 examples/s]
INFO 2025-05-16 20:57:09,195 instructlab.sdg.pipeline:199: Running block: eval_relevancy_qa_pair
INFO 2025-05-16 20:57:14,856 instructlab.sdg.pipeline:199: Running block: filter_relevancy
Map (num_proc=8): 100%|##########| 194/194 [00:00<00:00, 491.74 examples/s]
Filter (num_proc=8): 100%|##########| 194/194 [00:00<00:00, 814.75 examples/s]
INFO 2025-05-16 20:57:16,088 instructlab.sdg.pipeline:199: Running block: eval_verify_question
INFO 2025-05-16 20:57:22,217 instructlab.sdg.pipeline:199: Running block: filter_verify_question
Map (num_proc=8): 100%|##########| 173/173 [00:00<00:00, 446.50 examples/s]
Filter (num_proc=8): 100%|##########| 173/173 [00:00<00:00, 734.86 examples/s] 
INFO 2025-05-16 20:57:23,447 instructlab.sdg.checkpointing:44: Saving checkpoint to /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/knowledge_arts_music_fandom_swifties/data_checkpoint_9fb7728cd03d4ab0a5eb35a11f932c81.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 127.00ba/s]
INFO 2025-05-16 20:57:24,083 instructlab.sdg.generate_data:478: Generated 3276 samples
Generating train split: 2549 examples [00:00, 27916.54 examples/s]
INFO 2025-05-16 20:57:24,244 instructlab.sdg.checkpointing:64: Loading existing checkpoints from /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/knowledge_science_animals_birds_black_capped_chickadee, with 2549 rows
INFO 2025-05-16 20:57:24,267 instructlab.sdg.checkpointing:68: Found 4 missing rows in the dataset
INFO 2025-05-16 20:57:24,267 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:57:24,270 instructlab.sdg.pipeline:199: Running block: router
INFO 2025-05-16 20:57:26,144 instructlab.sdg.pipeline:199: Running block: SetClassifierValue
INFO 2025-05-16 20:57:26,156 instructlab.sdg.pipeline:199: Running block: duplicate_document_col
INFO 2025-05-16 20:57:26,163 instructlab.sdg.pipeline:199: Running block: gen_detailed_summary
INFO 2025-05-16 20:57:29,349 instructlab.sdg.pipeline:199: Running block: gen_atomic_facts
INFO 2025-05-16 20:57:36,635 instructlab.sdg.pipeline:199: Running block: gen_extractive_summary
INFO 2025-05-16 20:57:38,975 instructlab.sdg.pipeline:199: Running block: flatten_summary_columns
INFO 2025-05-16 20:57:38,991 instructlab.sdg.pipeline:199: Running block: rename_to_document_column
INFO 2025-05-16 20:57:39,004 instructlab.sdg.pipeline:199: Running block: knowledge generation
INFO 2025-05-16 20:58:07,371 instructlab.sdg.pipeline:199: Running block: eval_faithfulness_qa_pair
INFO 2025-05-16 20:58:12,957 instructlab.sdg.pipeline:199: Running block: filter_faithfulness
Map (num_proc=8): 100%|##########| 130/130 [00:00<00:00, 345.86 examples/s]
Filter (num_proc=8): 100%|##########| 130/130 [00:00<00:00, 538.15 examples/s]
INFO 2025-05-16 20:58:14,269 instructlab.sdg.pipeline:199: Running block: eval_relevancy_qa_pair
INFO 2025-05-16 20:58:17,000 instructlab.sdg.pipeline:199: Running block: filter_relevancy
Map (num_proc=8): 100%|##########| 54/54 [00:00<00:00, 151.56 examples/s]
Filter (num_proc=8): 100%|##########| 54/54 [00:00<00:00, 227.24 examples/s]
INFO 2025-05-16 20:58:18,259 instructlab.sdg.pipeline:199: Running block: eval_verify_question
INFO 2025-05-16 20:58:21,404 instructlab.sdg.pipeline:199: Running block: filter_verify_question
Map (num_proc=8): 100%|##########| 52/52 [00:00<00:00, 143.99 examples/s]
Filter (num_proc=8): 100%|##########| 52/52 [00:00<00:00, 218.14 examples/s]
INFO 2025-05-16 20:58:22,679 instructlab.sdg.checkpointing:44: Saving checkpoint to /var/home/cloud-user/.local/share/instructlab/datasets/checkpoints/knowledge_science_animals_birds_black_capped_chickadee/data_checkpoint_33316355ed9c4795a9e084691de0b72e.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 218.77ba/s]
INFO 2025-05-16 20:58:23,188 instructlab.sdg.generate_data:478: Generated 2599 samples
INFO 2025-05-16 20:58:23,215 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:58:23,219 instructlab.sdg.pipeline:199: Running block: gen_mmlu_knowledge
Filter: 100%|##########| 355/355 [00:00<00:00, 45323.81 examples/s]
Filter: 100%|##########| 355/355 [00:00<00:00, 25613.30 examples/s]
Flattening the indices: 100%|##########| 355/355 [00:00<00:00, 39431.63 examples/s]
Map: 100%|##########| 355/355 [00:00<00:00, 11058.79 examples/s]
Map: 100%|##########| 355/355 [00:00<00:00, 10256.15 examples/s]
Map: 100%|##########| 355/355 [00:00<00:00, 10301.07 examples/s]
Filter: 100%|##########| 355/355 [00:00<00:00, 39055.16 examples/s]
Filter: 100%|##########| 355/355 [00:00<00:00, 20605.55 examples/s]
Filter: 100%|##########| 353/353 [00:00<00:00, 20290.38 examples/s]
Flattening the indices: 100%|##########| 353/353 [00:00<00:00, 37776.88 examples/s]
Casting to class labels: 100%|##########| 353/353 [00:00<00:00, 10543.56 examples/s]
INFO 2025-05-16 20:58:38,381 instructlab.sdg.eval_data:126: Saving MMLU Dataset /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/node_datasets_2025-05-16T20_54_08/mmlubench_knowledge_arts_music_fandom_swifties.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 115.99ba/s]
INFO 2025-05-16 20:58:38,390 instructlab.sdg.eval_data:130: Saving MMLU Task yaml /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/node_datasets_2025-05-16T20_54_08/knowledge_arts_music_fandom_swifties_task.yaml
INFO 2025-05-16 20:58:38,400 instructlab.sdg.pipeline:161: Running pipeline with multi-threaded batching. Using 2 workers for batches of size 256
INFO 2025-05-16 20:58:38,404 instructlab.sdg.pipeline:199: Running block: gen_mmlu_knowledge
Filter: 100%|##########| 382/382 [00:00<00:00, 49746.15 examples/s]
Filter: 100%|##########| 382/382 [00:00<00:00, 26542.27 examples/s]
Flattening the indices: 100%|##########| 382/382 [00:00<00:00, 43644.25 examples/s]
Map: 100%|##########| 382/382 [00:00<00:00, 11010.79 examples/s]
Map: 100%|##########| 382/382 [00:00<00:00, 10253.45 examples/s]
Map: 100%|##########| 382/382 [00:00<00:00, 10340.53 examples/s]
Filter: 100%|##########| 382/382 [00:00<00:00, 39679.64 examples/s]
Filter: 100%|##########| 382/382 [00:00<00:00, 20491.42 examples/s]
Filter: 100%|##########| 375/375 [00:00<00:00, 20579.95 examples/s]
Flattening the indices: 100%|##########| 375/375 [00:00<00:00, 36612.29 examples/s]
Casting to class labels: 100%|##########| 375/375 [00:00<00:00, 10550.40 examples/s]
INFO 2025-05-16 20:58:54,991 instructlab.sdg.eval_data:126: Saving MMLU Dataset /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/node_datasets_2025-05-16T20_54_08/mmlubench_knowledge_science_animals_birds_black_capped_chickadee.jsonl
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 108.44ba/s]
INFO 2025-05-16 20:58:55,001 instructlab.sdg.eval_data:130: Saving MMLU Task yaml /var/home/cloud-user/.local/share/instructlab/datasets/2025-05-16_204950/node_datasets_2025-05-16T20_54_08/knowledge_science_animals_birds_black_capped_chickadee_task.yaml
Map (num_proc=8): 100%|##########| 155/155 [00:00<00:00, 344.08 examples/s]
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 69.68ba/s]
Map (num_proc=8): 100%|##########| 85/85 [00:00<00:00, 214.06 examples/s]
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 125.58ba/s]
Map (num_proc=8): 100%|##########| 70/70 [00:00<00:00, 199.30 examples/s]
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 210.94ba/s]
Map: 100%|##########| 3276/3276 [00:00<00:00, 8831.84 examples/s]
Map: 100%|##########| 3276/3276 [00:00<00:00, 32578.74 examples/s]
Filter: 100%|##########| 3276/3276 [00:00<00:00, 60491.57 examples/s]
Map: 100%|##########| 73/73 [00:00<00:00, 10435.01 examples/s]
Map: 100%|##########| 73/73 [00:00<00:00, 17914.94 examples/s]
Creating json from Arrow format: 100%|##########| 4/4 [00:00<00:00, 47.56ba/s]
Map: 100%|##########| 3276/3276 [00:00<00:00, 8871.92 examples/s]
Map: 100%|##########| 3276/3276 [00:00<00:00, 8734.35 examples/s]
Map: 100%|##########| 3276/3276 [00:00<00:00, 9000.04 examples/s]
Map: 100%|##########| 3276/3276 [00:00<00:00, 32770.42 examples/s]
Filter: 100%|##########| 3276/3276 [00:00<00:00, 60879.66 examples/s]
Map: 100%|##########| 73/73 [00:00<00:00, 10366.12 examples/s]
INFO 2025-05-16 20:59:11,016 instructlab.sdg.datamixing:774: Knowledge detected to be less than 3.00% of skills (1.68%), upsampling to: 11824
Creating json from Arrow format: 100%|##########| 7/7 [00:00<00:00, 25.63ba/s]
Map: 100%|##########| 2599/2599 [00:00<00:00, 8879.17 examples/s]
Map: 100%|##########| 2599/2599 [00:00<00:00, 32562.45 examples/s]
Filter: 100%|##########| 2599/2599 [00:00<00:00, 59478.58 examples/s]
Map: 100%|##########| 66/66 [00:00<00:00, 10450.92 examples/s]
Map: 100%|##########| 66/66 [00:00<00:00, 17161.00 examples/s]
Creating json from Arrow format: 100%|##########| 3/3 [00:00<00:00, 45.44ba/s]
Map: 100%|##########| 2599/2599 [00:00<00:00, 8889.64 examples/s]
Map: 100%|##########| 2599/2599 [00:00<00:00, 9018.61 examples/s]
Map: 100%|##########| 2599/2599 [00:00<00:00, 8960.69 examples/s]
Map: 100%|##########| 2599/2599 [00:00<00:00, 32301.45 examples/s]
Filter: 100%|##########| 2599/2599 [00:00<00:00, 59526.97 examples/s]
Map: 100%|##########| 66/66 [00:00<00:00, 10346.24 examples/s]
INFO 2025-05-16 20:59:13,015 instructlab.sdg.datamixing:774: Knowledge detected to be less than 3.00% of skills (1.34%), upsampling to: 11824
Creating json from Arrow format: 100%|##########| 6/6 [00:00<00:00, 29.20ba/s]
INFO 2025-05-16 20:59:14,023 instructlab.sdg.datamixing:158: Loading dataset from /usr/share/instructlab/sdg/datasets/skills.jsonl ...
Generating train split: 301205 examples [02:03, 2432.96 examples/s] 
INFO 2025-05-16 21:01:23,091 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
failed to generate data with exception: An error occurred while generating the dataset

real	11m46.468s
user	0m0.351s
sys	0m0.239s
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ find . | grep yaml
./.config/instructlab/config.yaml.lock
./.config/instructlab/config.yaml
./.local/share/instructlab/datasets/2025-05-16_191556/node_datasets_2025-05-16T19_18_39/knowledge_arts_music_fandom_swifties_task.yaml
./.local/share/instructlab/datasets/2025-05-16_191556/node_datasets_2025-05-16T19_18_39/knowledge_science_animals_birds_black_capped_chickadee_task.yaml
./.local/share/instructlab/datasets/2025-05-16_191556/knowledge_recipe_2025-05-16T19_18_39.yaml
./.local/share/instructlab/datasets/2025-05-16_191556/skills_recipe_2025-05-16T19_18_39.yaml
./.local/share/instructlab/datasets/2025-05-16_204950/node_datasets_2025-05-16T20_54_08/knowledge_arts_music_fandom_swifties_task.yaml
./.local/share/instructlab/datasets/2025-05-16_204950/node_datasets_2025-05-16T20_54_08/knowledge_science_animals_birds_black_capped_chickadee_task.yaml
./.local/share/instructlab/datasets/2025-05-16_204950/knowledge_recipe_2025-05-16T20_54_08.yaml
./.local/share/instructlab/datasets/2025-05-16_204950/skills_recipe_2025-05-16T20_54_08.yaml
./.local/share/instructlab/internal/train_configuration/additional/additional_args.yaml
./.local/share/instructlab/internal/system_profiles/amd/mi300x/mi300x_x4.yaml
./.local/share/instructlab/internal/system_profiles/amd/mi300x/mi300x_x2.yaml
./.local/share/instructlab/internal/system_profiles/amd/mi300x/mi300x_x8.yaml
./.local/share/instructlab/taxonomy/.markdownlint-cli2.yaml
./.local/share/instructlab/taxonomy/compositional_skills/grounded/linguistics/inclusion/qna.yaml
./.local/share/instructlab/taxonomy/compositional_skills/grounded/linguistics/writing/rewriting/qna.yaml
./.local/share/instructlab/taxonomy/compositional_skills/linguistics/synonyms/qna.yaml
./.local/share/instructlab/taxonomy/docs/template_qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/common_sense_reasoning/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/linguistics_reasoning/logical_sequence_of_words/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/linguistics_reasoning/object_identification/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/linguistics_reasoning/odd_one_out/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/logical_reasoning/causal/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/logical_reasoning/general/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/logical_reasoning/tabular/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/mathematical_reasoning/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/temporal_reasoning/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/theory_of_mind/qna.yaml
./.local/share/instructlab/taxonomy/foundational_skills/reasoning/unconventional_reasoning/lower_score_wins/qna.yaml
./.local/share/instructlab/taxonomy/knowledge/arts/music/fandom/swifties/qna.yaml
./.local/share/instructlab/taxonomy/knowledge/science/animals/birds/black_capped_chickadee/qna.yaml
./.local/share/instructlab/taxonomy/scripts/check-yaml.py
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ cat df 0^C
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ df -hl
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           882G  168K  882G   1% /dev/shm
tmpfs           353G  2.6M  353G   1% /run
/dev/vda4       249G  244G  5.1G  98% /sysroot
overlay          39M   39M     0 100% /
tmpfs           882G   20K  882G   1% /tmp
/dev/vda3       960M  101M  860M  11% /boot
/dev/vda2       501M  7.1M  494M   2% /boot/efi
tmpfs           177G   28K  177G   1% /run/user/1000
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ time ilab data generate | tee iso-testrun/ilab-data-generate-8-gpufix2
INFO 2025-05-16 21:05:45,352 instructlab.process.process:300: Started subprocess with PID 1. Logs are being written to /var/home/cloud-user/.local/share/instructlab/logs/generation/generation-88b6617e-3299-11f0-9046-0200048919a9.log.
INFO 2025-05-16 21:05:49,226 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
INFO 2025-05-16 21:05:50,652 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:43777/v1
INFO 2025-05-16 21:05:50,652 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:43777/v1
INFO 2025-05-16 21:05:50,652 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 1/120
INFO 2025-05-16 21:05:54,070 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 2/120
INFO 2025-05-16 21:05:57,480 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 3/120
INFO 2025-05-16 21:06:00,935 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 4/120
INFO 2025-05-16 21:06:04,421 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 5/120
INFO 2025-05-16 21:06:07,764 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 6/120
INFO 2025-05-16 21:06:10,983 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 7/120
INFO 2025-05-16 21:06:14,240 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 8/120
INFO 2025-05-16 21:06:17,683 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 9/120
INFO 2025-05-16 21:06:21,028 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 10/120
INFO 2025-05-16 21:06:24,291 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 11/120
INFO 2025-05-16 21:06:27,647 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 12/120
INFO 2025-05-16 21:06:30,916 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 13/120
INFO 2025-05-16 21:06:34,333 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 14/120
INFO 2025-05-16 21:06:37,756 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 15/120
INFO 2025-05-16 21:06:41,091 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 16/120
INFO 2025-05-16 21:06:44,487 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 17/120
INFO 2025-05-16 21:06:47,834 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:43777/v1, this might take a moment... Attempt: 18/120
^C
Aborted!
^C
real	1m7.328s
user	0m0.095s
sys	0m0.066s
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo mkfs.xfs -L ilab-data /dev^C
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ ps -ef | grep ilab
cloud-u+   49975   31553  0 21:43 pts/0    00:00:00 grep --color=auto ilab
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ df -hl
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           882G  168K  882G   1% /dev/shm
tmpfs           353G  2.6M  353G   1% /run
/dev/vda4       249G   39G  211G  16% /sysroot
overlay          39M   39M     0 100% /
tmpfs           882G   24K  882G   1% /tmp
/dev/vda3       960M  101M  860M  11% /boot
/dev/vda2       501M  7.1M  494M   2% /boot/efi
tmpfs           177G  3.3M  177G   1% /run/user/1000
/dev/nvme1n1    3.0T  227G  2.7T   8% /var/home/cloud-user/.cache
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ cat /etc/profile
# /etc/profile

# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc

# It's NOT a good idea to change this file unless you know what you
# are doing. It's much better to create a custom.sh shell script in
# /etc/profile.d/ to make custom changes to your environment, as this
# will prevent the need for merging in future updates.

pathmunge () {
    case ":${PATH}:" in
        *:"$1":*)
            ;;
        *)
            if [ "$2" = "after" ] ; then
                PATH=$PATH:$1
            else
                PATH=$1:$PATH
            fi
    esac
}


if [ -x /usr/bin/id ]; then
    if [ -z "$EUID" ]; then
        # ksh workaround
        EUID=`/usr/bin/id -u`
        UID=`/usr/bin/id -ru`
    fi
    USER="`/usr/bin/id -un`"
    LOGNAME=$USER
    MAIL="/var/spool/mail/$USER"
fi

# Path manipulation
if [ "$EUID" = "0" ]; then
    pathmunge /usr/sbin
    pathmunge /usr/local/sbin
else
    pathmunge /usr/local/sbin after
    pathmunge /usr/sbin after
fi

HOSTNAME=$(/usr/bin/hostnamectl --transient 2>/dev/null) || \
HOSTNAME=$(/usr/bin/hostname 2>/dev/null) || \
HOSTNAME=$(/usr/bin/uname -n)

HISTSIZE=1000
if [ "$HISTCONTROL" = "ignorespace" ] ; then
    export HISTCONTROL=ignoreboth
else
    export HISTCONTROL=ignoredups
fi

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL

for i in /etc/profile.d/*.sh /etc/profile.d/sh.local ; do
    if [ -r "$i" ]; then
        if [ "${-#*i}" != "$-" ]; then 
            . "$i"
        else
            . "$i" >/dev/null
        fi
    fi
done

unset i
unset -f pathmunge

if [ -n "${BASH_VERSION-}" ] ; then
        if [ -f /etc/bashrc ] ; then
                # Bash login shells run only /etc/profile
                # Bash non-login shells run only /etc/bashrc
                # Check for double sourcing is done in /etc/bashrc.
                . /etc/bashrc
       fi
fi
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ umask
0022
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo vim /etc/dnf/
aliases.d/          dnf.conf            modules.d/          modules.defaults.d/ plugins/            protected.d/        vars/               
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo vim /etc/dnf/
aliases.d/          dnf.conf            modules.d/          modules.defaults.d/ plugins/            protected.d/        vars/               
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo vim /etc/dnf/dnf.conf 
sudo: vim: command not found
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo vi /etc/dnf/dnf.conf 
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ ls -latr
total 28
-rw-r--r--.  1 cloud-user cloud-user  492 May 16 05:38 .bashrc
-rw-r--r--.  1 cloud-user cloud-user   18 May 16 05:38 .bash_logout
drwxr-xr-x.  3 root       root         24 May 16 17:37 ..
drwx------.  2 cloud-user cloud-user   29 May 16 17:37 .ssh
drwxr-xr-x.  3 cloud-user cloud-user   19 May 16 19:04 .triton
drwxr-xr-x.  6 cloud-user cloud-user   66 May 16 19:07 .config
drwx------.  3 cloud-user cloud-user   43 May 16 19:13 .local
drwxr-xr-x.  2 cloud-user cloud-user  151 May 16 21:05 iso-testrun-orig
-rw-r--r--.  1 cloud-user cloud-user 2368 May 16 21:16 EL_AI_test_1.5.sh
drwxr-xr-x.  5 cloud-user cloud-user   58 May 16 21:42 .cache
-rw-r--r--.  1 cloud-user cloud-user  141 May 16 21:42 .bash_profile
-rw-------.  1 cloud-user cloud-user 6657 May 16 21:42 .bash_history
drwxr-xr-x.  2 cloud-user cloud-user   80 May 16 21:47 iso-testrun.surprisingly-quick-because-data-existed
drwx------. 10 cloud-user cloud-user 4096 May 16 22:07 .
drwxr-xr-x.  2 cloud-user cloud-user   80 May 16 22:09 iso-testrun
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo dnf repolist
Updating Subscription Management repositories.
Repository fast-datapath-for-rhel-9-x86_64-rpms is listed more than once in the configuration
repo id                                                                                       repo name
codeready-builder-for-rhel-9-x86_64-eus-rpms                                                  Red Hat CodeReady Linux Builder for RHEL 9 x86_64 - Extended Update Support (RPMs)
rhel-9-for-x86_64-appstream-eus-rpms                                                          Red Hat Enterprise Linux 9 for x86_64 - AppStream - Extended Update Support (RPMs)
rhel-9-for-x86_64-appstream-rpms                                                              Red Hat Enterprise Linux 9 for x86_64 - AppStream (RPMs)
rhel-9-for-x86_64-baseos-eus-rpms                                                             Red Hat Enterprise Linux 9 for x86_64 - BaseOS - Extended Update Support (RPMs)
rhel-9-for-x86_64-baseos-rpms                                                                 Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ man rhc
-bash: man: command not found
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ rhc --help
NAME:
   rhc - control the system's connection to Red Hat

USAGE:
   rhc [global options] command [command options] [arguments...]

VERSION:
   0.2.4

DESCRIPTION:
   The rhc command controls the system's connection to Red Hat.
   
   To connect the system using an activation key:
     rhc connect --organization ID --activation-key KEY
   
   To connect the system using a username and password:
     rhc connect --username USERNAME --password PASSWORD
   
   To disconnect the system:
     rhc disconnect
   
   Run 'rhc command --help' for more details.

COMMANDS:
   connect     Connects the system to Red Hat
   disconnect  Disconnects the system from Red Hat
   status      Prints status of the system's connection to Red Hat
   help, h     Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --no-color     (default: false) [$NO_COLOR]
   --help, -h     show help (default: false)
   --version, -v  print the version (default: false)
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ rhc connect --help
NAME:
   rhc connect - Connects the system to Red Hat

USAGE:
   rhc connect [command options]

DESCRIPTION:
   The connect command connects the system to Red Hat Subscription Management, Red Hat Insights and Red Hat and activates the Remote Host Configuration daemon that enables Red Hat to interact with the system. For details visit: https://red.ht/connector

OPTIONS:
   --username USERNAME, -u USERNAME  register with USERNAME
   --password PASSWORD, -p PASSWORD  register with PASSWORD
   --organization ID, -o ID          register with ID
   --activation-key KEY, -a KEY      register with KEY
   --help, -h                        show help (default: false)
   
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ df -hl
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           882G  168K  882G   1% /dev/shm
tmpfs           353G  2.6M  353G   1% /run
/dev/vda4       249G   38G  211G  16% /sysroot
overlay          39M   39M     0 100% /
tmpfs           882G   24K  882G   1% /tmp
/dev/vda3       960M  101M  860M  11% /boot
/dev/vda2       501M  7.1M  494M   2% /boot/efi
tmpfs           177G  3.3M  177G   1% /run/user/1000
/dev/nvme1n1    3.0T  247G  2.7T   9% /var/home/cloud-user/.cache
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ du -sh .data
du: cannot access '.data': No such file or directory
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo du -sh .local
204M	.local
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo du -sh .local
204M	.local
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo du -sh .*
226G	.
0	..
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo du -sh ls^C
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ ls
EL_AI_test_1.5.sh  iso-testrun  iso-testrun-orig  iso-testrun.surprisingly-quick-because-data-existed
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ ls -latrh
total 28K
-rw-r--r--.  1 cloud-user cloud-user  492 May 16 05:38 .bashrc
-rw-r--r--.  1 cloud-user cloud-user   18 May 16 05:38 .bash_logout
drwxr-xr-x.  3 root       root         24 May 16 17:37 ..
drwx------.  2 cloud-user cloud-user   29 May 16 17:37 .ssh
drwxr-xr-x.  3 cloud-user cloud-user   19 May 16 19:04 .triton
drwxr-xr-x.  6 cloud-user cloud-user   66 May 16 19:07 .config
drwx------.  3 cloud-user cloud-user   43 May 16 19:13 .local
drwxr-xr-x.  2 cloud-user cloud-user  151 May 16 21:05 iso-testrun-orig
-rw-r--r--.  1 cloud-user cloud-user 2.4K May 16 21:16 EL_AI_test_1.5.sh
drwxr-xr-x.  5 cloud-user cloud-user   58 May 16 21:42 .cache
-rw-r--r--.  1 cloud-user cloud-user  141 May 16 21:42 .bash_profile
-rw-------.  1 cloud-user cloud-user 6.6K May 16 21:42 .bash_history
drwxr-xr-x.  2 cloud-user cloud-user   80 May 16 21:47 iso-testrun.surprisingly-quick-because-data-existed
drwx------. 10 cloud-user cloud-user 4.0K May 16 22:07 .
drwxr-xr-x.  2 cloud-user cloud-user   80 May 16 22:09 iso-testrun
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ ps -ef | grep ilab
cloud-u+   56428   55215  0 22:09 pts/1    00:00:02 podman run --rm -it --device /dev/kfd --device /dev/dri --security-opt label=disable --net host --shm-size 10G --pids-limit -1 -v /var/home/cloud-user:/var/home/cloud-user -v /run/user/1000/containers/auth.json:/run/containers/0/auth.json --env HF_TOKEN --env HOME --env NCCL_DEBUG --env VLLM_LOGGING_LEVEL --entrypoint ilab registry.redhat.io/rhelai1/instructlab-amd-rhel9:1.5.0 data generate
cloud-u+   56429   55215  0 22:09 pts/1    00:00:00 tee iso-testrun/ilab-data-generate
cloud-u+   56463   56461  3 22:09 pts/0    00:01:54 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69663   56463 23 23:10 pts/0    00:00:03 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69664   56463 23 23:10 pts/0    00:00:03 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69665   56463 23 23:10 pts/0    00:00:03 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69666   56463 23 23:10 pts/0    00:00:03 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69667   56463 23 23:10 pts/0    00:00:03 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69668   56463 23 23:10 pts/0    00:00:03 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69670   56463 23 23:10 pts/0    00:00:03 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69672   56463 23 23:10 pts/0    00:00:03 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69676   56463  1 23:10 pts/0    00:00:00 /opt/app-root/bin/python3.11 /opt/app-root/bin/ilab data generate
cloud-u+   69699   31553  0 23:10 pts/0    00:00:00 grep --color=auto ilab
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo stat ~/is^C
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ stat iso-testrun/ilab-data-generate 
  File: iso-testrun/ilab-data-generate
  Size: 117856    	Blocks: 256        IO Block: 4096   regular file
Device: fc04h/64516d	Inode: 335545603   Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/cloud-user)   Gid: ( 1000/cloud-user)
Context: unconfined_u:object_r:user_home_t:s0
Access: 2025-05-16 22:09:09.208850483 +0000
Modify: 2025-05-16 23:10:32.803864226 +0000
Change: 2025-05-16 23:10:32.803864226 +0000
 Birth: 2025-05-16 22:09:09.208850483 +0000
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ du -sh .local
du: cannot read directory '.local/share/containers/storage/overlay/98c8240177bd8a6b73d5905378898ac7041bc83b78fd507d4fe48e57440e5b00/work': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/98c8240177bd8a6b73d5905378898ac7041bc83b78fd507d4fe48e57440e5b00/merged': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/9cb605b6f1f7b077e2748d722186c245084395342abf99203bbf9714b74e70e8/work': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/9cb605b6f1f7b077e2748d722186c245084395342abf99203bbf9714b74e70e8/merged': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/6eedafc59ab17fa9036bd7425e11d439d4c4781f090725357f3f92468cdf0eee/work': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/6eedafc59ab17fa9036bd7425e11d439d4c4781f090725357f3f92468cdf0eee/merged': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/3ca1c90b09f45d5bbc933f7dd8f193baa31be5efa46f4308d3d71193f3304a6e/work': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/3ca1c90b09f45d5bbc933f7dd8f193baa31be5efa46f4308d3d71193f3304a6e/merged': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/5098d9f69b7a3c6216a0e0afe8e2de94e7528371a3b9eaf998e9a50f5eb9ac54/work': Permission denied
du: cannot read directory '.local/share/containers/storage/overlay/5098d9f69b7a3c6216a0e0afe8e2de94e7528371a3b9eaf998e9a50f5eb9ac54/merged': Permission denied
7.5G	.local
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo du -sh .local
7.5G	.local
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo find .loca^C
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ cd .lo^C
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ sudo du -sh .local
7.6G	.local
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ nvtop
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ nvtop
[cloud-user@mdepaulo-v15-7-prod-amd ~]$ Read from remote host 169.63.187.52: Connection timed out
Connection to 169.63.187.52 closed.
client_loop: send disconnect: Broken pipe
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh mikedep333-ibm-us-east
ssh: connect to host 169.63.187.52 port 22: Connection timed out
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh mikedep333-ibm-us-east
^C
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh 169.63.187.52      

ssh: connect to host 169.63.187.52 port 22: Connection timed out
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh 169.63.187.52      
ssh: connect to host 169.63.187.52 port 22: Connection timed out
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ sh 150.240.3.148

sh: 150.240.3.148: No such file or directory
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ sh 150.240.3.148
sh: 150.240.3.148: No such file or directory
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh 150.240.3.148
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:aFYfaC7m+szooyVrhZ+3GaEyltY/ETUfkzPUmHePPTM.
Please contact your system administrator.
Add correct host key in /home/mdepaulo/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/mdepaulo/.ssh/known_hosts:246
Host key for 150.240.3.148 has changed and you have requested strict checking.
Host key verification failed.
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh-keygen -R 150.240.3.148
# Host 150.240.3.148 found: line 244
# Host 150.240.3.148 found: line 245
# Host 150.240.3.148 found: line 246
/home/mdepaulo/.ssh/known_hosts updated.
Original contents retained as /home/mdepaulo/.ssh/known_hosts.old
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh 150.240.3.148
The authenticity of host '150.240.3.148 (150.240.3.148)' can't be established.
ED25519 key fingerprint is SHA256:aFYfaC7m+szooyVrhZ+3GaEyltY/ETUfkzPUmHePPTM.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '150.240.3.148' (ED25519) to the list of known hosts.
mdepaulo@150.240.3.148: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh cloud-user2@150.240.3.148
cloud-user2@150.240.3.148: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh cloud-user@150.240.3.148
Register this system with Red Hat Insights: insights-client --register
Create an account or view all your systems at https://red.ht/insights-dashboard
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ cd .config/containers/^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ln -s /var/run/u
udev/    udisks2/ user/    utmp     
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ln -s /var/run/u
udev/    udisks2/ user/    utmp     
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ln -s /var/run/user/1000/
bus      systemd/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ln -s /var/run/user/1000/
bus      systemd/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ln -s /var/run/user/1000/
bus      systemd/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ln -s /var/run/user/1000/co^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo mkdir /var/run/user/1000/containers
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ln -s^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rmdir /var/run/user/1000/containers
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ mkdir /var/run/user/1000/containers
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ln -s ^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo cp ^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ cp ~/.config/containers/auth.json /var/run/user/1000/containers/
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vim EL_AI_test_1.5.sh 
-bash: vim: command not found
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vi EL_AI_test_1.5.sh 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ ilab --version
This host is not connected to Red Hat Insights.

To connect this host to Red Hat Insights run the following command:
sudo rhc connect --organization <org_id> --activation-key <your_activation_key>

To generate an Activation Key:
https://console.redhat.com/insights/connector/activation-keys (this page will also display your Organization ID).

For more information on Red Hat Insights, please visit:
https://docs.redhat.com/en/documentation/subscription_central/1-latest/html/getting_started_with_activation_keys_on_the_hybrid_cloud_console/assembly-creating-managing-activation-keys
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rhc connect --organization 11009103 --activation-key mdepaulo-rhelai-qe
error: non-root user cannot connect system
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rhc connect --organization 11009103 --activation-key mdepaulo-rhelai-qe
Connecting mdepaulo-v157-amd-prod-2 to Red Hat.
This might take a few seconds.

● Connected to Red Hat Subscription Management
● Connected to Red Hat Insights
● Activated the Remote Host Configuration daemon
● Enabled console.redhat.com services: remote configuration, insights, remediations, compliance

Successfully connected to Red Hat!

Manage your connected systems: https://red.ht/connector
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ ilab --version
ilab, version 0.26.1
+ sudo cp /run/user/1000/containers/auth.json /etc/ostree/
+ mkdir iso-testrun
+ ilab config init

----------------------------------------------------
         Welcome to the InstructLab CLI
  This guide will help you to setup your environment
----------------------------------------------------

Please provide the following values to initiate the environment [press 'Enter' for default options when prompted]
Cloning https://github.com/instructlab/taxonomy.git...

Generating config file:
    /var/home/cloud-user/.config/instructlab/config.yaml

Please choose a system profile.
Profiles set hardware-specific defaults for all commands and sections of the configuration.
First, please select the hardware vendor your system falls into
[0] NO SYSTEM PROFILE
[1] AMD
Enter the number of your choice [0]: 1
You selected: AMD
Next, please select the specific hardware configuration that most closely matches your system.
[0] NO SYSTEM PROFILE
[1] AMD MI300X X4
[2] AMD MI300X X2
[3] AMD MI300X X8
Enter the number of your choice [hit enter for hardware defaults] [0]: 3
You selected: /var/home/cloud-user/.local/share/instructlab/internal/system_profiles/amd/mi300x/mi300x_x8.yaml

--------------------------------------------
    Initialization completed successfully!
  You're ready to start using `ilab`. Enjoy!
--------------------------------------------
+ sed -i 's/gpus: 1/gpus: 8/g' /var/home/cloud-user/.config/instructlab/config.yaml
+ ilab config show
+ ilab system info
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/skills-adapter-v3 --release 1.5
INFO 2025-05-19 14:14:32,583 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/skills-adapter-v3@1.5
    Destination: /var/home/cloud-user/.cache/instructlab/models
Copying blob 4452b845ab9c done   | 
Copying blob 01f47425d010 done   | 
Copying blob cfc7749b96f6 done   | 
Copying blob cd99f66c98e5 done   | 
Copying blob 6f4761a5ce47 done   | 
Copying blob 488e082ff0d1 done   | 
Copying blob d8d4489231c6 done   | 
Copying blob 5d44fdf2d36d done   | 
Copying config 44136fa355 done   | 
Writing manifest to image destination
INFO 2025-05-19 14:14:42,310 instructlab.model.download:288: 
ᕦ(òᴗóˇ)ᕤ docker://registry.stage.redhat.io/rhelai1/skills-adapter-v3 model download completed successfully! ᕦ(òᴗóˇ)ᕤ

INFO 2025-05-19 14:14:42,310 instructlab.model.download:302: Available models (`ilab model list`):
+------------+---------------+------+---------------+
| Model Name | Last Modified | Size | Absolute path |
+------------+---------------+------+---------------+
+------------+---------------+------+---------------+
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/knowledge-adapter-v3 --release 1.5
INFO 2025-05-19 14:14:48,845 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/knowledge-adapter-v3@1.5
    Destination: /var/home/cloud-user/.cache/instructlab/models
Copying blob e84e60569620 done   | 
Copying blob 4d0d6bb4d9d0 done   | 
Copying blob 82d96d7a9e6c done   | 
Copying blob c4334cbcdf17 done   | 
Copying blob 488e082ff0d1 done   | 
Copying blob cfc7749b96f6 done   | 
Copying blob 0f17dc4a3b97 done   | 
Copying blob d2313c03a149 done   | 
Copying config 44136fa355 done   | 
Writing manifest to image destination
INFO 2025-05-19 14:14:54,821 instructlab.model.download:288: 
ᕦ(òᴗóˇ)ᕤ docker://registry.stage.redhat.io/rhelai1/knowledge-adapter-v3 model download completed successfully! ᕦ(òᴗóˇ)ᕤ

INFO 2025-05-19 14:14:54,821 instructlab.model.download:302: Available models (`ilab model list`):
+------------+---------------+------+---------------+
| Model Name | Last Modified | Size | Absolute path |
+------------+---------------+------+---------------+
+------------+---------------+------+---------------+
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-lab-v2 --release 1.5
INFO 2025-05-19 14:15:01,263 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-lab-v2@1.5
    Destination: /var/home/cloud-user/.cache/instructlab/models
Copying blob db47a10e7df0 [=========>----------------------------] 1.2GiB / 4.6GiB | 754.3 MiB/s
Copying blob 303127a244b0 done   | 
Copying blob 081bdeaf76fc done   | 
Copying blob 3f4905316ed0 done   | 
Copying blob 694fca9fdcbf [=========>----------------------------] 1.2GiB / 4.6GiB | 541.4 MiB/s
Copying blob cfeca4972fa9 [=========>----------------------------] 1.2GiB / 4.6GiB | 544.5 MiB/s
Copying blob 6ff9f3935185 [=========================>------------] 1.1GiB / 1.7GiB | 416.2 MiB/s
Copying blob ac7915d4e17b done   | 
Copying blob 05db54e4d322 done   | 
Copying blob ed944c0b3d71 done   | 
Copying blob 66a07f75fd8d done   | 
Copying blob 80ab859339a2 done   | 
^C
Aborted!
^C^C^C^C^C^C^C^C^C^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0     7:0    0  38.5M  1 loop 
zram0   251:0    0     8G  0 disk [SWAP]
vda     252:0    0   250G  0 disk 
├─vda1  252:1    0     1M  0 part 
├─vda2  252:2    0   501M  0 part /boot/efi
├─vda3  252:3    0     1G  0 part /boot
└─vda4  252:4    0 248.5G  0 part /var
                                  /sysroot/ostree/deploy/default/var
                                  /etc
                                  /sysroot
vdb     252:16   0   366K  0 disk 
vdc     252:32   0    44K  0 disk 
vdd     252:48   0  1000G  0 disk 
nvme7n1 259:0    0   2.9T  0 disk 
nvme6n1 259:1    0   2.9T  0 disk 
nvme2n1 259:2    0   2.9T  0 disk 
nvme0n1 259:3    0   2.9T  0 disk 
nvme1n1 259:4    0   2.9T  0 disk 
nvme3n1 259:5    0   2.9T  0 disk 
nvme4n1 259:6    0   2.9T  0 disk 
nvme5n1 259:7    0   2.9T  0 disk 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo mkdir^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ls -latr ^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo sgdisk -n 1:0:0 /dev/vdd
Creating new GPT entries in memory.
The operation has completed successfully.
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ mkfs.xfs -L ilab-data /dev/vdd
mkfs.xfs: cannot open /dev/vdd: Permission denied
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo mkfs.xfs -L ilab-data /dev/vdd
mkfs.xfs: /dev/vdd appears to contain a partition table (gpt).
mkfs.xfs: Use the -f option to force overwrite.
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo mkfs.xfs -L ilab-data /dev/vdd1
meta-data=/dev/vdd1              isize=512    agcount=4, agsize=65535935 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=4096   blocks=262143739, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=127999, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ echo LABEL=ilab-data /mnt xfs defaults 0 0 | sudo tee -a /etc/fstab
LABEL=ilab-data /mnt xfs defaults 0 0
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo systemctl daemon-reload
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo mount -a
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo chmod 1777 /mnt
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ mkdir -p /mnt/.config/containers
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ cp ~/.config/containers/
auth.json     storage.conf  
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ cp ~/.config/containers/storage.conf /mnt/.config/containers
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vim ~/.bash_profile 
-bash: vim: command not found
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ :$
-bash: :$: command not found
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vi ~/.bash_profile 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vi .bashrc 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vim .bash_profile 
-bash: vim: command not found
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vim .bash_profile 
-bash: vim: command not found
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vi .bash_profile 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash -i
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ exit
exit
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo chmod 1777 /mnt^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ ilab --version
ilab, version 0.26.1
+ sudo cp /run/user/1000/containers/auth.json /etc/ostree/
+ mkdir iso-testrun
mkdir: cannot create directory ‘iso-testrun’: File exists
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf iso-testrun/
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh ^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ls -latr
total 20
-rw-r--r--. 1 cloud-user cloud-user  492 May 16 05:38 .bashrc
-rw-r--r--. 1 cloud-user cloud-user   18 May 16 05:38 .bash_logout
drwxr-xr-x. 3 root       root         24 May 19 12:56 ..
drwx------. 2 cloud-user cloud-user   29 May 19 12:56 .ssh
-rw-r--r--. 1 cloud-user cloud-user 3566 May 19 14:09 EL_AI_test_1.5.sh
drwx------. 3 cloud-user cloud-user   19 May 19 14:09 .local
drwxr-xr-x. 5 cloud-user cloud-user   54 May 19 14:11 .config
drwxr-xr-x. 3 cloud-user cloud-user   25 May 19 14:11 .cache
-rw-r--r--. 1 cloud-user cloud-user  168 May 19 14:22 .bash_profile
-rw-------. 1 cloud-user cloud-user   10 May 19 14:23 .bash_history
drwx------. 6 cloud-user cloud-user  163 May 19 14:24 .
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf .cache/instructlab/
models/ oci/    
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf .cache/
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf .local/share/
containers/  instructlab/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf .local/share/
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ mkdir .cache
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rmdir .ca^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ls -latrZ
total 20
-rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0      492 May 16 05:38 .bashrc
-rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0       18 May 16 05:38 .bash_logout
drwxr-xr-x. 3 root       root       system_u:object_r:home_root_t:s0           24 May 19 12:56 ..
drwx------. 2 cloud-user cloud-user system_u:object_r:ssh_home_t:s0            29 May 19 12:56 .ssh
-rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0     3566 May 19 14:09 EL_AI_test_1.5.sh
drwxr-xr-x. 5 cloud-user cloud-user unconfined_u:object_r:config_home_t:s0     54 May 19 14:11 .config
-rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0      168 May 19 14:22 .bash_profile
-rw-------. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0       10 May 19 14:23 .bash_history
drwx------. 2 cloud-user cloud-user unconfined_u:object_r:gconf_home_t:s0       6 May 19 14:24 .local
drwxr-xr-x. 2 cloud-user cloud-user unconfined_u:object_r:cache_home_t:s0       6 May 19 14:24 .cache
drwx------. 6 cloud-user cloud-user unconfined_u:object_r:user_home_dir_t:s0  163 May 19 14:24 .
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo restorecon -F .cache
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ls -latr
total 20
-rw-r--r--. 1 cloud-user cloud-user  492 May 16 05:38 .bashrc
-rw-r--r--. 1 cloud-user cloud-user   18 May 16 05:38 .bash_logout
drwxr-xr-x. 3 root       root         24 May 19 12:56 ..
drwx------. 2 cloud-user cloud-user   29 May 19 12:56 .ssh
-rw-r--r--. 1 cloud-user cloud-user 3566 May 19 14:09 EL_AI_test_1.5.sh
drwxr-xr-x. 5 cloud-user cloud-user   54 May 19 14:11 .config
-rw-r--r--. 1 cloud-user cloud-user  168 May 19 14:22 .bash_profile
-rw-------. 1 cloud-user cloud-user   10 May 19 14:23 .bash_history
drwx------. 2 cloud-user cloud-user    6 May 19 14:24 .local
drwxr-xr-x. 2 cloud-user cloud-user    6 May 19 14:24 .cache
drwx------. 6 cloud-user cloud-user  163 May 19 14:24 .
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo ls -latrZ
total 20
-rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0      492 May 16 05:38 .bashrc
-rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0       18 May 16 05:38 .bash_logout
drwxr-xr-x. 3 root       root       system_u:object_r:home_root_t:s0           24 May 19 12:56 ..
drwx------. 2 cloud-user cloud-user system_u:object_r:ssh_home_t:s0            29 May 19 12:56 .ssh
-rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0     3566 May 19 14:09 EL_AI_test_1.5.sh
drwxr-xr-x. 5 cloud-user cloud-user unconfined_u:object_r:config_home_t:s0     54 May 19 14:11 .config
-rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0      168 May 19 14:22 .bash_profile
-rw-------. 1 cloud-user cloud-user unconfined_u:object_r:user_home_t:s0       10 May 19 14:23 .bash_history
drwx------. 2 cloud-user cloud-user unconfined_u:object_r:gconf_home_t:s0       6 May 19 14:24 .local
drwxr-xr-x. 2 cloud-user cloud-user unconfined_u:object_r:cache_home_t:s0       6 May 19 14:24 .cache
drwx------. 6 cloud-user cloud-user unconfined_u:object_r:user_home_dir_t:s0  163 May 19 14:24 .
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ ilab --version
ilab, version 0.26.1
+ sudo cp /run/user/1000/containers/auth.json /etc/ostree/
+ mkdir iso-testrun
+ ilab config init
Existing config file was found in:
    /var/home/cloud-user/.config/instructlab/config.yaml
Do you still want to continue? [y/N]: ^CAborted!
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm .config/
cni/         containers/  instructlab/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm .config/ins^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf .config/instructlab/
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf .config/c
cni/        containers/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf .config/cni/net.d/cni.lock 
.bash_history      .bash_profile      .cache/            .local/            EL_AI_test_1.5.sh  
.bash_logout       .bashrc            .config/           .ssh/              iso-testrun/       
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf .config/cni/net.d/cni.lock 
.bash_history      .bash_profile      .cache/            .local/            EL_AI_test_1.5.sh  
.bash_logout       .bashrc            .config/           .ssh/              iso-testrun/       
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf .config/containers/
auth.json     storage.conf  
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf .config/containers/
auth.json     storage.conf  
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf .config/containers/storage.conf ^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ /etc/skel/.config/containers/storage.conf /mnt/.config/containers/storage.conf
-bash: /etc/skel/.config/containers/storage.conf: Permission denied
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo ^Ctc/skel/.config/containers/storage.conf /mnt/.config/containers/storage.conf
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo chmod o^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ umask
0022
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ls -latr /etc/skel/
total 24
drwxr-xr-x.  3 root root   24 May 16 05:38 .config
-rw-r--r--.  1 root root  492 May 16 05:38 .bashrc
-rw-r--r--.  1 root root  141 May 16 05:38 .bash_profile
-rw-r--r--.  1 root root   18 May 16 05:38 .bash_logout
drwxr-xr-x.  3 root root   77 May 16 05:38 .
drwxr-xr-x. 90 root root 8192 May 19 12:56 ..
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ls -latr /etc/skel/.config/
total 0
drwxr-xr-x. 2 root root 26 May 16 05:38 containers
drwxr-xr-x. 3 root root 77 May 16 05:38 ..
drwxr-xr-x. 3 root root 24 May 16 05:38 .
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ls -latr /etc/skel/.config/containers/
total 4
-rw-r--r--. 1 root root 330 May 16 05:38 storage.conf
drwxr-xr-x. 3 root root  24 May 16 05:38 ..
drwxr-xr-x. 2 root root  26 May 16 05:38 .
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo chown cloud-user /mnt/.^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo ls -latr /mnt.config
ls: cannot access '/mnt.config': No such file or directory
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo ls -latr /mnt/
total 4
drwxr-xr-x. 24 root       root       4096 May 19 12:56 ..
drwxrwxrwt.  3 root       root         21 May 19 14:21 .
drwxr-xr-x.  3 cloud-user cloud-user   24 May 19 14:21 .config
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo ls -latr /mnt/.config/
total 0
drwxrwxrwt. 3 root       root       21 May 19 14:21 ..
drwxr-xr-x. 3 cloud-user cloud-user 24 May 19 14:21 .
drwxr-xr-x. 2 cloud-user cloud-user 26 May 19 14:21 containers
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo ls -latr /mnt/.config/containers/
total 4
drwxr-xr-x. 3 cloud-user cloud-user  24 May 19 14:21 ..
-rw-r--r--. 1 cloud-user cloud-user 330 May 19 14:21 storage.conf
drwxr-xr-x. 2 cloud-user cloud-user  26 May 19 14:21 .
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo cat /mnt/.config/containers/
cat: /mnt/.config/containers/: Is a directory
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo cat /mnt/.config/containers/storage.conf 
[storage]
driver = "overlay"
[storage.options]
size = ""
remap-uids = ""
remap-gids = ""
ignore_chown_errors = ""
remap-user = ""
remap-group = ""
skip_mount_home = ""
mount_program = "/usr/bin/fuse-overlayfs"
mountopt = ""
additionalimagestores = [ "/usr/lib/containers/storage",]
[storage.options.overlay]
force_mask = "shared"
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ diff /etc/skel/.config/containers/storage.conf /mnt/.config/containers/storage.conf
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf .config/instructlab/^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ ilab --version
ilab, version 0.26.1
+ sudo cp /run/user/1000/containers/auth.json /etc/ostree/
+ mkdir iso-testrun
mkdir: cannot create directory ‘iso-testrun’: File exists
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf iso-testrun/
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ ilab --version
ilab, version 0.26.1
+ sudo cp /run/user/1000/containers/auth.json /etc/ostree/
+ mkdir iso-testrun
+ ilab config init

Existing system profiles were found in:
    /var/home/cloud-user/.local/share/instructlab/internal/system_profiles
Do you want to restore these profiles to the default values? [y/N]: ^CAborted!
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo rm -rf^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ echo $ILAB_HOME

[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ exit
logout
Connection to 150.240.3.148 closed.
mdepaulo@mdepaulo-thinkpadx1nanogen2:~/rhelai/ecosystem-rhel-ai$ ssh cloud-user@150.240.3.148
Last login: Mon May 19 14:23:42 2025 from 98.116.66.226
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ env | grep HOME
HOME=/var/home/cloud-user
ILAB_HOME=/mnt
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ ls -latr
total 20
-rw-r--r--. 1 cloud-user cloud-user  492 May 16 05:38 .bashrc
-rw-r--r--. 1 cloud-user cloud-user   18 May 16 05:38 .bash_logout
drwxr-xr-x. 3 root       root         24 May 19 12:56 ..
drwx------. 2 cloud-user cloud-user   29 May 19 12:56 .ssh
-rw-r--r--. 1 cloud-user cloud-user 3566 May 19 14:09 EL_AI_test_1.5.sh
-rw-r--r--. 1 cloud-user cloud-user  168 May 19 14:22 .bash_profile
drwx------. 3 cloud-user cloud-user   19 May 19 14:25 .local
drwxr-xr-x. 3 cloud-user cloud-user   25 May 19 14:25 .cache
drwxr-xr-x. 2 cloud-user cloud-user    6 May 19 14:27 iso-testrun
drwx------. 7 cloud-user cloud-user  182 May 19 14:27 .
drwxr-xr-x. 5 cloud-user cloud-user   54 May 19 14:27 .config
-rw-------. 1 cloud-user cloud-user 1660 May 19 14:28 .bash_history
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf .config/
cni/         containers/  instructlab/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf .config/
cni/         containers/  instructlab/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf .config/
cni/         containers/  instructlab/ 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf .config .local/ .cache/
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ ilab --version
ilab, version 0.26.1
+ sudo cp /run/user/1000/containers/auth.json /etc/ostree/
+ mkdir iso-testrun
mkdir: cannot create directory ‘iso-testrun’: File exists
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rm -rf iso-testrun/^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ rmdir iso-testrun/
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ ilab --version
ilab, version 0.26.1
+ sudo cp /run/user/1000/containers/auth.json /etc/ostree/
+ mkdir iso-testrun
+ ilab config init

----------------------------------------------------
         Welcome to the InstructLab CLI
  This guide will help you to setup your environment
----------------------------------------------------

Please provide the following values to initiate the environment [press 'Enter' for default options when prompted]
Cloning https://github.com/instructlab/taxonomy.git...

Generating config file:
    /mnt/.config/instructlab/config.yaml

Please choose a system profile.
Profiles set hardware-specific defaults for all commands and sections of the configuration.
First, please select the hardware vendor your system falls into
[0] NO SYSTEM PROFILE
[1] AMD
Enter the number of your choice [0]: 1
You selected: AMD
Next, please select the specific hardware configuration that most closely matches your system.
[0] NO SYSTEM PROFILE
[1] AMD MI300X X4
[2] AMD MI300X X2
[3] AMD MI300X X8
Enter the number of your choice [hit enter for hardware defaults] [0]: 3
You selected: /mnt/.local/share/instructlab/internal/system_profiles/amd/mi300x/mi300x_x8.yaml

--------------------------------------------
    Initialization completed successfully!
  You're ready to start using `ilab`. Enjoy!
--------------------------------------------
+ sed -i 's/gpus: 1/gpus: 8/g' /var/home/cloud-user/.config/instructlab/config.yaml
sed: can't read /var/home/cloud-user/.config/instructlab/config.yaml: No such file or directory
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ cp EL_AI_test_1.5.sh EL_AI_test_1.5.rem^C
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vim EL_AI_test_1.5.sh 
-bash: vim: command not found
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ vi EL_AI_test_1.5.sh 
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ bash EL_AI_test_1.5.sh 
+ podman login registry.redhat.io
Authenticating with existing credentials for registry.redhat.io
Existing credentials are valid. Already logged in to registry.redhat.io
+ sed -i 's/gpus: 1/gpus: 8/g' /mnt/.config/instructlab/config.yaml
+ ilab config show
+ ilab system info
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/skills-adapter-v3 --release 1.5
INFO 2025-05-19 14:36:09,612 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/skills-adapter-v3@1.5
    Destination: /mnt/.cache/instructlab/models
Copying blob 6f4761a5ce47 done   | 
Copying blob cd99f66c98e5 done   | 
Copying blob cfc7749b96f6 done   | 
Copying blob 4452b845ab9c done   | 
Copying blob 488e082ff0d1 done   | 
Copying blob 01f47425d010 done   | 
Copying blob d8d4489231c6 done   | 
Copying blob 5d44fdf2d36d done   | 
Copying config 44136fa355 done   | 
Writing manifest to image destination
INFO 2025-05-19 14:36:17,098 instructlab.model.download:288: 
ᕦ(òᴗóˇ)ᕤ docker://registry.stage.redhat.io/rhelai1/skills-adapter-v3 model download completed successfully! ᕦ(òᴗóˇ)ᕤ

INFO 2025-05-19 14:36:17,098 instructlab.model.download:302: Available models (`ilab model list`):
+------------+---------------+------+---------------+
| Model Name | Last Modified | Size | Absolute path |
+------------+---------------+------+---------------+
+------------+---------------+------+---------------+
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/knowledge-adapter-v3 --release 1.5
INFO 2025-05-19 14:36:23,501 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/knowledge-adapter-v3@1.5
    Destination: /mnt/.cache/instructlab/models
Copying blob c4334cbcdf17 done   | 
Copying blob 4d0d6bb4d9d0 done   | 
Copying blob 488e082ff0d1 done   | 
Copying blob e84e60569620 done   | 
Copying blob 82d96d7a9e6c done   | 
Copying blob cfc7749b96f6 done   | 
Copying blob 0f17dc4a3b97 done   | 
Copying blob d2313c03a149 done   | 
Copying config 44136fa355 done   | 
Writing manifest to image destination
INFO 2025-05-19 14:36:28,854 instructlab.model.download:288: 
ᕦ(òᴗóˇ)ᕤ docker://registry.stage.redhat.io/rhelai1/knowledge-adapter-v3 model download completed successfully! ᕦ(òᴗóˇ)ᕤ

INFO 2025-05-19 14:36:28,854 instructlab.model.download:302: Available models (`ilab model list`):
+------------+---------------+------+---------------+
| Model Name | Last Modified | Size | Absolute path |
+------------+---------------+------+---------------+
+------------+---------------+------+---------------+
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-lab-v2 --release 1.5
INFO 2025-05-19 14:36:35,380 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-lab-v2@1.5
    Destination: /mnt/.cache/instructlab/models
Copying blob cfeca4972fa9 done   | 
Copying blob 694fca9fdcbf done   | 
Copying blob 081bdeaf76fc done   | 
Copying blob db47a10e7df0 done   | 
Copying blob 3f4905316ed0 done   | 
Copying blob 303127a244b0 done   | 
Copying blob 6ff9f3935185 done   | 
Copying blob ac7915d4e17b done   | 
Copying blob 05db54e4d322 done   | 
Copying blob ed944c0b3d71 done   | 
Copying blob 66a07f75fd8d done   | 
Copying blob 80ab859339a2 done   | 
Copying config 44136fa355 done   | 
Writing manifest to image destination
INFO 2025-05-19 14:39:06,562 instructlab.model.download:288: 
ᕦ(òᴗóˇ)ᕤ docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-lab-v2 model download completed successfully! ᕦ(òᴗóˇ)ᕤ

INFO 2025-05-19 14:39:06,562 instructlab.model.download:302: Available models (`ilab model list`):
+------------------------------+---------------------+---------+------------------------------------------------------+
| Model Name                   | Last Modified       | Size    | Absolute path                                        |
+------------------------------+---------------------+---------+------------------------------------------------------+
| models/granite-3.1-8b-lab-v2 | 2025-05-19 14:39:06 | 15.6 GB | /mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2 |
+------------------------------+---------------------+---------+------------------------------------------------------+
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-starter-v2 --release 1.5
INFO 2025-05-19 14:39:12,995 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-starter-v2@1.5
    Destination: /mnt/.cache/instructlab/models
Copying blob 8dc79e9b964f done   | 
Copying blob 3b555dd64b66 done   | 
Copying blob 303127a244b0 done   | 
Copying blob 6262ed507463 done   | 
Copying blob 55febb012f75 done   | 
Copying blob 081bdeaf76fc done   | 
Copying blob 2548cc91efbb done   | 
Copying blob 3119969eb015 done   | 
Copying blob ac7915d4e17b done   | 
Copying blob 05db54e4d322 done   | 
Copying blob ed944c0b3d71 done   | 
Copying blob 66a07f75fd8d done   | 
Copying blob 3e1391c11dea done   | 
Copying blob 80ab859339a2 done   | 
Copying config 44136fa355 done   | 
Writing manifest to image destination
INFO 2025-05-19 14:41:46,932 instructlab.model.download:288: 
ᕦ(òᴗóˇ)ᕤ docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-starter-v2 model download completed successfully! ᕦ(òᴗóˇ)ᕤ

INFO 2025-05-19 14:41:46,932 instructlab.model.download:302: Available models (`ilab model list`):
+----------------------------------+---------------------+---------+----------------------------------------------------------+
| Model Name                       | Last Modified       | Size    | Absolute path                                            |
+----------------------------------+---------------------+---------+----------------------------------------------------------+
| models/granite-3.1-8b-lab-v2     | 2025-05-19 14:39:06 | 15.6 GB | /mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2     |
| models/granite-3.1-8b-starter-v2 | 2025-05-19 14:41:46 | 15.6 GB | /mnt/.cache/instructlab/models/granite-3.1-8b-starter-v2 |
+----------------------------------+---------------------+---------+----------------------------------------------------------+
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.5
INFO 2025-05-19 14:41:53,493 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1@1.5
    Destination: /mnt/.cache/instructlab/models
Copying blob d0b63fca793c done   | 
Copying blob 29e15364d8ab done   | 
Copying blob 47324f06fdb5 done   | 
Copying blob 40e6ecbcedfc done   | 
Copying blob 9d56d04b36d0 done   | 
Copying blob 54669c5aec29 done   | 
Copying blob 67e0596920fe done   | 
Copying blob e330eabd70b4 done   | 
Copying blob 048fa5347877 done   | 
Copying blob 83bfed6169c1 done   | 
Copying blob af316ad78402 done   | 
Copying blob 5882e4366c63 done   | 
Copying blob 77813d1dbee6 done   | 
Copying blob ff24540d9967 done   | 
Copying blob 48bc12845676 done   | 
Copying blob e56a2e7eda69 done   | 
Copying blob da627f6a3c8f done   | 
Copying blob 61e0f22bff93 done   | 
Copying blob 76466bfc2312 done   | 
Copying blob 570af3b802be done   | 
Copying blob 4c603b65cbd5 done   | 
Copying blob 272f33c76bca done   | 
Copying blob a8f30ebfaf56 done   | 
Copying blob 6fa06efa2785 done   | 
Copying blob 11c08db21487 done   | 
Copying blob dadfd56d7667 done   | 
Copying blob 475361439e5c done   | 
Copying config 44136fa355 done   | 
Writing manifest to image destination
INFO 2025-05-19 14:51:53,163 instructlab.model.download:288: 
ᕦ(òᴗóˇ)ᕤ docker://registry.stage.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 model download completed successfully! ᕦ(òᴗóˇ)ᕤ

INFO 2025-05-19 14:51:53,163 instructlab.model.download:302: Available models (`ilab model list`):
+-----------------------------------+---------------------+---------+-----------------------------------------------------------+
| Model Name                        | Last Modified       | Size    | Absolute path                                             |
+-----------------------------------+---------------------+---------+-----------------------------------------------------------+
| models/granite-3.1-8b-lab-v2      | 2025-05-19 14:39:06 | 15.6 GB | /mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2      |
| models/granite-3.1-8b-starter-v2  | 2025-05-19 14:41:46 | 15.6 GB | /mnt/.cache/instructlab/models/granite-3.1-8b-starter-v2  |
| models/mixtral-8x7b-instruct-v0-1 | 2025-05-19 14:51:53 | 87.0 GB | /mnt/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1 |
+-----------------------------------+---------------------+---------+-----------------------------------------------------------+
+ ilab model download --repository docker://registry.stage.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.5
INFO 2025-05-19 14:51:59,851 instructlab.model.download:192: Downloading model from OCI registry:
    Model: docker://registry.stage.redhat.io/rhelai1/prometheus-8x7b-v2-0@1.5
    Destination: /mnt/.cache/instructlab/models
Copying blob cc0b434114a0 done   | 
Copying blob 40e6ecbcedfc done   | 
Copying blob 45147a3fae61 done   | 
Copying blob a375e93d6f89 done   | 
Copying blob 17e420ee7a3c done   | 
Copying blob 9d56d04b36d0 done   | 
Copying blob 07529e846183 done   | 
Copying blob 69239081714b done   | 
Copying blob 82ba1df1bcff done   | 
Copying blob 7dfbb89db40a done   | 
Copying blob d6b91c38dcac done   | 
Copying blob 042fa6758c75 done   | 
Copying blob fc2658c9dba2 done   | 
Copying blob 958bf1eb6fc6 done   | 
Copying blob 4cfc38eabca1 done   | 
Copying blob d89723805505 done   | 
Copying blob ad148e16985f done   | 
Copying blob 520bd83ae1b8 done   | 
Copying blob 189922a4c16e done   | 
Copying blob 96b05ad26199 done   | 
Copying blob e6086166348b done   | 
Copying blob af6f32190c41 done   | 
Copying blob 92470b0bd930 done   | 
Copying blob a8f30ebfaf56 done   | 
Copying blob 96bdbb8504d9 done   | 
Copying blob fc4f0bd70b37 done   | 
Copying blob dadfd56d7667 done   | 
Copying blob 7ada2fa1461c done   | 
Copying config 44136fa355 done   | 
Writing manifest to image destination
INFO 2025-05-19 15:02:12,748 instructlab.model.download:288: 
ᕦ(òᴗóˇ)ᕤ docker://registry.stage.redhat.io/rhelai1/prometheus-8x7b-v2-0 model download completed successfully! ᕦ(òᴗóˇ)ᕤ

INFO 2025-05-19 15:02:12,748 instructlab.model.download:302: Available models (`ilab model list`):
+-----------------------------------+---------------------+---------+-----------------------------------------------------------+
| Model Name                        | Last Modified       | Size    | Absolute path                                             |
+-----------------------------------+---------------------+---------+-----------------------------------------------------------+
| models/granite-3.1-8b-lab-v2      | 2025-05-19 14:39:06 | 15.6 GB | /mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2      |
| models/granite-3.1-8b-starter-v2  | 2025-05-19 14:41:46 | 15.6 GB | /mnt/.cache/instructlab/models/granite-3.1-8b-starter-v2  |
| models/mixtral-8x7b-instruct-v0-1 | 2025-05-19 14:51:53 | 87.0 GB | /mnt/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1 |
| models/prometheus-8x7b-v2-0       | 2025-05-19 15:02:12 | 87.0 GB | /mnt/.cache/instructlab/models/prometheus-8x7b-v2-0       |
+-----------------------------------+---------------------+---------+-----------------------------------------------------------+
+ ilab taxonomy diff
compositional_skills/grounded/linguistics/inclusion/qna.yaml
compositional_skills/grounded/linguistics/writing/rewriting/qna.yaml
compositional_skills/linguistics/synonyms/qna.yaml
knowledge/arts/music/fandom/swifties/qna.yaml
knowledge/science/animals/birds/black_capped_chickadee/qna.yaml
Taxonomy in /mnt/.local/share/instructlab/taxonomy is valid :)
+ ilab model serve
INFO 2025-05-19 15:02:29,233 instructlab.model.serve_backend:80: Setting backend_type in the serve config to vllm
INFO 2025-05-19 15:02:29,249 instructlab.model.serve_backend:86: Using model '/mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2' with -1 gpu-layers and 4096 max context size.
INFO 2025-05-19 15:04:42,074 instructlab.model.serve_backend:133: '--gpus' flag used alongside '--tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus flag.
INFO 2025-05-19 15:04:42,343 instructlab.model.backends.vllm:332: vLLM starting up on pid 6 at http://127.0.0.1:8000/v1
INFO 05-19 15:05:05 [__init__.py:239] Automatically detected platform rocm.
INFO 05-19 15:05:07 [api_server.py:1034] vLLM API server version 0.8.4
INFO 05-19 15:05:07 [api_server.py:1035] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/tmp/tmp23qjiqo4', chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend='mp', pipeline_parallel_size=1, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2', 'granite-3.1-8b-lab-v2', 'models/granite-3.1-8b-lab-v2', 'models/granite-3.1-8b-starter-v2', 'models/mixtral-8x7b-instruct-v0-1', 'models/prometheus-8x7b-v2-0'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 05-19 15:07:35 [config.py:689] This model supports multiple tasks: {'generate', 'embed', 'score', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 05-19 15:07:35 [arg_utils.py:1742] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
WARNING 05-19 15:07:35 [arg_utils.py:1603] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 05-19 15:09:55 [api_server.py:246] Started engine process with PID 59
INFO 05-19 15:09:59 [__init__.py:239] Automatically detected platform rocm.
INFO 05-19 15:10:00 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2', speculative_config=None, tokenizer='/mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/.cache/instructlab/models/granite-3.1-8b-lab-v2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 05-19 15:10:00 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 104 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 05-19 15:10:04 [__init__.py:239] Automatically detected platform rocm.
INFO 05-19 15:10:04 [__init__.py:239] Automatically detected platform rocm.
INFO 05-19 15:10:04 [__init__.py:239] Automatically detected platform rocm.
INFO 05-19 15:10:04 [__init__.py:239] Automatically detected platform rocm.
INFO 05-19 15:10:04 [__init__.py:239] Automatically detected platform rocm.
INFO 05-19 15:10:04 [__init__.py:239] Automatically detected platform rocm.
INFO 05-19 15:10:04 [__init__.py:239] Automatically detected platform rocm.
(VllmWorkerProcess pid=82) INFO 05-19 15:10:06 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=85) INFO 05-19 15:10:06 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=84) INFO 05-19 15:10:06 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=87) INFO 05-19 15:10:06 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=83) INFO 05-19 15:10:06 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=86) INFO 05-19 15:10:06 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=81) INFO 05-19 15:10:06 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
^CINFO 2025-05-19 15:13:19,404 instructlab.model.backends.vllm:85: vLLM server terminated by keyboard
Traceback (most recent call last):
  File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 264, in build_async_engine_client_from_engine_args
    await mq_engine_client.setup()
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 284, in setup
    response = await self._wait_for_server_rpc(socket)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 392, in _wait_for_server_rpc
    return await self._send_get_data_rpc_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 320, in _send_get_data_rpc_request
    if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
    uvloop.run(run_server(args))
  File "/opt/app-root/lib64/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/asyncio/runners.py", line 123, in run
    raise KeyboardInterrupt()
KeyboardInterrupt
INFO 2025-05-19 15:13:20,776 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
+ ilab model chat
INFO 2025-05-19 15:14:04,611 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
INFO 2025-05-19 15:14:06,222 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:54991/v1
INFO 2025-05-19 15:14:06,222 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:54991/v1
INFO 2025-05-19 15:14:06,222 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 1/120
INFO 2025-05-19 15:14:09,517 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 2/120
INFO 2025-05-19 15:14:12,847 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 3/120
INFO 2025-05-19 15:14:16,168 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 4/120
INFO 2025-05-19 15:14:19,399 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 5/120
INFO 2025-05-19 15:14:22,694 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 6/120
INFO 2025-05-19 15:14:26,002 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 7/120
INFO 2025-05-19 15:14:29,375 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 8/120
INFO 2025-05-19 15:14:32,813 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 9/120
INFO 2025-05-19 15:14:36,067 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 10/120
INFO 2025-05-19 15:14:39,372 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 11/120
INFO 2025-05-19 15:14:42,696 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 12/120
INFO 2025-05-19 15:14:46,036 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 13/120
INFO 2025-05-19 15:14:49,246 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 14/120
INFO 2025-05-19 15:14:52,632 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 15/120
INFO 2025-05-19 15:14:55,873 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 16/120
INFO 2025-05-19 15:14:59,188 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 17/120
INFO 2025-05-19 15:15:02,605 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 18/120
INFO 2025-05-19 15:15:05,980 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 19/120
INFO 2025-05-19 15:15:09,163 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 20/120
INFO 2025-05-19 15:15:12,452 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 21/120
INFO 2025-05-19 15:15:15,836 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 22/120
INFO 2025-05-19 15:15:19,285 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 23/120
INFO 2025-05-19 15:15:22,500 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 24/120
INFO 2025-05-19 15:15:25,848 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 25/120
INFO 2025-05-19 15:15:29,290 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 26/120
INFO 2025-05-19 15:15:32,498 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 27/120
INFO 2025-05-19 15:15:35,819 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 28/120
INFO 2025-05-19 15:15:39,119 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 29/120
INFO 2025-05-19 15:15:42,324 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 30/120
INFO 2025-05-19 15:15:45,567 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 31/120
INFO 2025-05-19 15:15:48,867 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 32/120
INFO 2025-05-19 15:15:52,286 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 33/120
INFO 2025-05-19 15:15:55,744 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 34/120
INFO 2025-05-19 15:15:58,983 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 35/120
INFO 2025-05-19 15:16:02,157 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 36/120
INFO 2025-05-19 15:16:05,506 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 37/120
INFO 2025-05-19 15:16:08,804 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 38/120
INFO 2025-05-19 15:16:12,195 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 39/120
INFO 2025-05-19 15:16:15,489 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 40/120
INFO 2025-05-19 15:16:18,719 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 41/120
INFO 2025-05-19 15:16:21,893 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 42/120
INFO 2025-05-19 15:16:25,222 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 43/120
INFO 2025-05-19 15:16:28,586 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 44/120
INFO 2025-05-19 15:16:31,920 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 45/120
INFO 2025-05-19 15:16:35,245 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 46/120
INFO 2025-05-19 15:16:38,490 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 47/120
INFO 2025-05-19 15:16:41,964 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 48/120
INFO 2025-05-19 15:16:45,371 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 49/120
INFO 2025-05-19 15:16:48,699 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 50/120
INFO 2025-05-19 15:16:51,940 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 51/120
INFO 2025-05-19 15:16:55,266 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 52/120
INFO 2025-05-19 15:16:58,518 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 53/120
INFO 2025-05-19 15:17:01,903 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 54/120
INFO 2025-05-19 15:17:05,316 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 55/120
INFO 2025-05-19 15:17:08,625 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 56/120
INFO 2025-05-19 15:17:11,872 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 57/120
INFO 2025-05-19 15:17:15,222 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 58/120
INFO 2025-05-19 15:17:18,525 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 59/120
INFO 2025-05-19 15:17:21,930 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 60/120
INFO 2025-05-19 15:17:25,167 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 61/120
INFO 2025-05-19 15:17:28,493 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 62/120
INFO 2025-05-19 15:17:31,706 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 63/120
INFO 2025-05-19 15:17:35,054 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 64/120
INFO 2025-05-19 15:17:38,317 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 65/120
INFO 2025-05-19 15:17:41,594 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 66/120
INFO 2025-05-19 15:17:45,044 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 67/120
INFO 2025-05-19 15:17:48,344 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 68/120
INFO 2025-05-19 15:17:51,647 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 69/120
INFO 2025-05-19 15:17:54,889 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 70/120
INFO 2025-05-19 15:17:58,178 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 71/120
INFO 2025-05-19 15:18:01,446 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 72/120
INFO 2025-05-19 15:18:04,620 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 73/120
INFO 2025-05-19 15:18:07,853 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 74/120
INFO 2025-05-19 15:18:11,262 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 75/120
INFO 2025-05-19 15:18:14,445 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 76/120
INFO 2025-05-19 15:18:17,710 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 77/120
INFO 2025-05-19 15:18:21,052 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 78/120
INFO 2025-05-19 15:18:24,338 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 79/120
INFO 2025-05-19 15:18:27,632 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 80/120
INFO 2025-05-19 15:18:30,924 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 81/120
INFO 2025-05-19 15:18:34,209 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 82/120
INFO 2025-05-19 15:18:37,453 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 83/120
INFO 2025-05-19 15:18:40,689 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 84/120
INFO 2025-05-19 15:18:44,120 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 85/120
INFO 2025-05-19 15:18:47,559 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 86/120
INFO 2025-05-19 15:18:50,786 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 87/120
INFO 2025-05-19 15:18:54,033 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 88/120
INFO 2025-05-19 15:18:57,329 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 89/120
INFO 2025-05-19 15:19:00,654 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 90/120
INFO 2025-05-19 15:19:03,870 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 91/120
INFO 2025-05-19 15:19:07,315 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 92/120
INFO 2025-05-19 15:19:10,668 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 93/120
INFO 2025-05-19 15:19:14,096 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 94/120
INFO 2025-05-19 15:19:17,243 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 95/120
INFO 2025-05-19 15:19:20,492 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 96/120
INFO 2025-05-19 15:19:23,795 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 97/120
INFO 2025-05-19 15:19:27,006 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 98/120
INFO 2025-05-19 15:19:30,214 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 99/120
INFO 2025-05-19 15:19:33,614 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 100/120
INFO 2025-05-19 15:19:37,021 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 101/120
INFO 2025-05-19 15:19:40,376 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 102/120
INFO 2025-05-19 15:19:43,693 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 103/120
INFO 2025-05-19 15:19:47,065 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 104/120
INFO 2025-05-19 15:19:50,298 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 105/120
INFO 2025-05-19 15:19:53,620 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 106/120
INFO 2025-05-19 15:19:57,074 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 107/120
INFO 2025-05-19 15:20:00,427 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 108/120
INFO 2025-05-19 15:20:03,711 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 109/120
INFO 2025-05-19 15:20:07,163 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 110/120
INFO 2025-05-19 15:20:10,586 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 111/120
INFO 2025-05-19 15:20:13,932 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 112/120
INFO 2025-05-19 15:20:17,300 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 113/120
INFO 2025-05-19 15:20:20,653 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 114/120
INFO 2025-05-19 15:20:23,929 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 115/120
INFO 2025-05-19 15:20:27,198 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 116/120
INFO 2025-05-19 15:20:30,547 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 117/120
INFO 2025-05-19 15:20:33,802 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 118/120
INFO 2025-05-19 15:20:37,087 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 119/120
INFO 2025-05-19 15:20:40,373 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:54991/v1, this might take a moment... Attempt: 120/120
INFO 2025-05-19 15:20:41,685 instructlab.model.backends.vllm:148: Gave up waiting for vLLM server to start at http://127.0.0.1:54991/v1 after 120 attempts
INFO 2025-05-19 15:20:51,912 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
Traceback (most recent call last):
  File "/opt/app-root/bin/ilab", line 8, in <module>
    sys.exit(ilab())
             ^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/chat.py", line 199, in chat
    chat_model(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/chat.py", line 688, in chat_model
    api_base = backend_instance.run_detached(http_client(params))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/backends/vllm.py", line 179, in run_detached
    raise e
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/backends/vllm.py", line 169, in run_detached
    vllm_server_process, api_base = self._ensure_server(
                                    ^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/backends/vllm.py", line 156, in _ensure_server
    raise ServerException(f"vLLM failed to start up in {duration} seconds")
instructlab.model.backends.common.ServerException: vLLM failed to start up in 395.5 seconds
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ cat EL_AI_test_1.5.sh 
set -eux
#####
# podman login registry.stage.redhat.io # add credentials
podman login registry.redhat.io # add credentials
#ilab --version # to get a rhc connect command!
#sudo cp /run/user/1000/containers/auth.json /etc/ostree/ #to make bootc switch worky
################
#mkdir iso-testrun
#ilab config init
#sed -i '/--tensor-parallel-size/,+1d' $HOME/.config/instructlab/config.yaml
sed -i 's/gpus: 1/gpus: 8/g' $ILAB_HOME/.config/instructlab/config.yaml
ilab config show > iso-testrun/ilab-config-show
ilab system info > iso-testrun/ilab-system-info
### Pay attention to what models are to be used for testing the speciffic releases, this is valid for 1.4 !!!
### Also, pay attention to the .stage in the url - if you're doing prod testing, it'd be docker://registry.redhat.io
ilab model download --repository docker://registry.stage.redhat.io/rhelai1/skills-adapter-v3 --release 1.5
ilab model download --repository docker://registry.stage.redhat.io/rhelai1/knowledge-adapter-v3 --release 1.5
ilab model download --repository docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-lab-v2 --release 1.5
ilab model download --repository docker://registry.stage.redhat.io/rhelai1/granite-3.1-8b-starter-v2 --release 1.5
ilab model download --repository docker://registry.stage.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.5
ilab model download --repository docker://registry.stage.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.5
# END OF MODEL DOWNLOADS
ilab taxonomy diff
ilab model serve # Ctrl + C after gunicorn starts
ilab model chat
# tmux
time ilab data generate | tee iso-testrun/ilab-data-generate
# CTRL + B + D
# tail -f iso-testrun/ilab-data-generate # to watch progress and not stress about ssh connection drop
# ocassionally check output of nvidia-smi -l 3
### end of data generation
shuf -n 15000 .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_*.jsonl > .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl
# tmux a
time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 | tee iso-testrun/ilab-train
# CTRL + B + D
# tail -f iso-testrun/ilab-train # to watch progress and not stress about ssh connection drop
# ^^^ This is for a "short" training form which we currently use

## Do the following only if you're absolutely sure you should be doing the "long" testing!
#################### LOOOOONG ##########################
#tmux a
#time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_20*.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 | tee iso-testrun/ilab-train
## CTRL + B + D
#tail -f iso-testrun/ilab-train # to watch progress and not stress about ssh connection drop
#################### LOOOOONG ##########################

# And finally verify the chat works
ilab model chat

[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ nvtop
[cloud-user@mdepaulo-v157-amd-prod-2 ~]$ sudo nvtop