-
Bug
-
Resolution: Done
-
Critical
-
rhelai-1.4, rhelai-1.4.1
-
None
-
False
-
-
False
-
-
-
Critical
-
Approved
To Reproduce Steps to reproduce the behavior:
- Run generate and train with the default setting, the 1.4 dataset, and this taxonomy: https://github.com/RedHatOfficial/rhelai-sample-taxonomy. More detailed steps can be found here: https://gitlab.com/redhat/rhel-ai/diip/-/blob/main/scripts/test_rhelai.sh?ref_type=heads
- It will fail with:
Training Phase 2/2... TrainingArgs for current phase: TrainingArgs(model_path='/mnt/4TB/.local/share/instructlab/phased/phase1/checkpoints/hf_format/samples_1525', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', use_legacy_tmpl=True, data_path='/mnt/4TB/.local/share/instructlab/datasets/2025-02-10_062843/skills_train_msgs_2025-02-10T06_30_07.jsonl', ckpt_output_dir='/mnt/4TB/.local/share/instructlab/phased/phase2/checkpoints', data_output_dir='/mnt/4TB/.local/share/instructlab/internal', max_seq_len=10000, max_batch_len=60000, num_epochs=1, effective_batch_size=3840, save_samples=0, learning_rate=6e-06, warmup_steps=25, random_seed=42, use_dolomite=True, is_padding_free=False, checkpoint_at_epoch=True, accelerate_full_state_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), fsdp_options=FSDPOptions(cpu_offload_params=False, sharding_strategy=<ShardingStrategies.SHARD_GRAD_OP: 'SHARD_GRAD_OP'>), distributed_backend=<DistributedBackend.FSDP: 'fsdp'>, disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>), process_data=True, keep_last_checkpoint_only=False) data arguments are: {"data_path":"/mnt/4TB/.local/share/instructlab/datasets/2025-02-10_062843/skills_train_msgs_2025-02-10T06_30_07.jsonl","data_output_path":"/mnt/4TB/.local/share/instructlab/internal","max_seq_len":10000,"model_path":"/mnt/4TB/.local/share/instructlab/phased/phase1/checkpoints/hf_format/samples_1525","chat_tmpl_path":"/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_legacy_tmpl.py","num_cpu_procs":16} INFO 2025-02-10 07:05:56,084 root:879: Special tokens: eos: [0], pad: [49153], bos: [49152], system: [49154], user: [49155], assistant: [49156] Generating train split: 344673 examples [00:02, 139674.49 examples/s] --- Logging error --- Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 2013, in _prepare_split_single writer.write_table(table) File "/opt/app-root/lib64/python3.11/site-packages/datasets/arrow_writer.py", line 585, in write_table pa_table = table_cast(pa_table, self._schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2281, in table_cast return cast_table_to_schema(table, schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2240, in cast_table_to_schema arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2240, in <listcomp> arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1795, in wrapper return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1795, in <listcomp> return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2098, in cast_array_to_feature return array_cast( ^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1797, in wrapper return func(array, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1948, in array_cast raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}") TypeError: Couldn't cast array of type string to null The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/data_process.py", line 288, in main data = load_dataset("json", data_files=args.data_path, split="train") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/datasets/load.py", line 2628, in load_dataset builder_instance.download_and_prepare( File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1029, in download_and_prepare self._download_and_prepare( File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1124, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1884, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 2040, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 261, in _run_phase _training_phase( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 563, in _training_phase run_training(train_args=train_args, torch_args=torch_args) File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training return run_training(torch_args=torch_args, train_args=train_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 672, in run_training dp.main( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/data_process.py", line 291, in main raise Exception( Exception: Malformed or missing data, please ensure that your dataset is not empty and correctly formatted During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib64/python3.11/logging/__init__.py", line 1110, in emit msg = self.format(record) ^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/logging/__init__.py", line 953, in format return fmt.format(record) ^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py", line 19, in format return super().format(record) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/logging/__init__.py", line 687, in format record.message = record.getMessage() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/logging/__init__.py", line 377, in getMessage msg = msg % self.args ~~~~^~~~~~~~~~~ TypeError: not all arguments converted during string formatting Call stack: File "/opt/app-root/bin/ilab", line 8, in <module> sys.exit(ilab()) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__ return self.main(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1082, in main rv = self.invoke(ctx) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper return f(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py", line 469, in train accelerated_train.accelerated_train( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 202, in accelerated_train _run_phased_training( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 432, in _run_phased_training _run_phase( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 276, in _run_phase logger.error("Failed during training loop: ", e) Message: 'Failed during training loop: ' Arguments: (Exception('Malformed or missing data, please ensure that your dataset is not empty and correctly formatted'),) Accelerated Training failed with 1
Example:
https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9086908356
Expected behavior
- Training should work
Screenshots
- Attached Image
Device Info (please complete the following information):
- Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]
I have tried with 8xL40s on AWS and 8xA100 IBM Cloud
- OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
- InstructLab Version: [output of \\\{{{}ilab --version{}}}]
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
- ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware
[cloud-user@instructlab-ci-8xa100-preserve ~]$ sudo bootc status --format json | jq .status.booted.image.image.image
"registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4"
[cloud-user@instructlab-ci-8xa100-preserve cloud-user]$ ilab system info
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Platform:
sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.50.1.el9_4.x86_64
platform.machine: x86_64
platform.node: instructlab-ci-8xa100-preserve
platform.python_version: 3.11.7
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
memory.total: 1259.87 GB
memory.available: 1248.68 GB
memory.used: 3.27 GB
InstructLab:
instructlab.version: 0.23.1
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.7.0
instructlab-training.version: 0.7.0
Torch:
torch.version: 2.5.1
torch.backends.cpu.capability: AVX512
torch.version.cuda: 12.4
torch.version.hip: None
torch.cuda.available: True
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
torch.cuda.bf16: True
torch.cuda.current.device: 0
torch.cuda.0.name: NVIDIA A100-SXM4-80GB
torch.cuda.0.free: 78.7 GB
torch.cuda.0.total: 79.1 GB
torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.1.name: NVIDIA A100-SXM4-80GB
torch.cuda.1.free: 78.7 GB
torch.cuda.1.total: 79.1 GB
torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.2.name: NVIDIA A100-SXM4-80GB
torch.cuda.2.free: 78.7 GB
torch.cuda.2.total: 79.1 GB
torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.3.name: NVIDIA A100-SXM4-80GB
torch.cuda.3.free: 78.7 GB
torch.cuda.3.total: 79.1 GB
torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.4.name: NVIDIA A100-SXM4-80GB
torch.cuda.4.free: 78.7 GB
torch.cuda.4.total: 79.1 GB
torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.5.name: NVIDIA A100-SXM4-80GB
torch.cuda.5.free: 78.7 GB
torch.cuda.5.total: 79.1 GB
torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.6.name: NVIDIA A100-SXM4-80GB
torch.cuda.6.free: 78.7 GB
torch.cuda.6.total: 79.1 GB
torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.7.name: NVIDIA A100-SXM4-80GB
torch.cuda.7.free: 78.7 GB
torch.cuda.7.total: 79.1 GB
torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
llama_cpp_python:
llama_cpp_python.version: 0.3.2
llama_cpp_python.supports_gpu_offload: True
Bug impact
Can't use instructlab to create a model
Known workaround
- Please add any known workarounds.
Can't used mixin dataset?
Additional context
- <your text here>
- …