Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: rhelai-1.4.1
Affects Version/s: rhelai-1.4, rhelai-1.4.1
Component/s: InstructLab - Training
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Critical

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Run generate and train with the default setting, the 1.4 dataset, and this taxonomy: https://github.com/RedHatOfficial/rhelai-sample-taxonomy. More detailed steps can be found here: https://gitlab.com/redhat/rhel-ai/diip/-/blob/main/scripts/test_rhelai.sh?ref_type=heads
It will fail with:

Training Phase 2/2...
TrainingArgs for current phase: TrainingArgs(model_path='/mnt/4TB/.local/share/instructlab/phased/phase1/checkpoints/hf_format/samples_1525', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', use_legacy_tmpl=True, data_path='/mnt/4TB/.local/share/instructlab/datasets/2025-02-10_062843/skills_train_msgs_2025-02-10T06_30_07.jsonl', ckpt_output_dir='/mnt/4TB/.local/share/instructlab/phased/phase2/checkpoints', data_output_dir='/mnt/4TB/.local/share/instructlab/internal', max_seq_len=10000, max_batch_len=60000, num_epochs=1, effective_batch_size=3840, save_samples=0, learning_rate=6e-06, warmup_steps=25, random_seed=42, use_dolomite=True, is_padding_free=False, checkpoint_at_epoch=True, accelerate_full_state_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), fsdp_options=FSDPOptions(cpu_offload_params=False, sharding_strategy=<ShardingStrategies.SHARD_GRAD_OP: 'SHARD_GRAD_OP'>), distributed_backend=<DistributedBackend.FSDP: 'fsdp'>, disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>), process_data=True, keep_last_checkpoint_only=False)
 data arguments are:
{"data_path":"/mnt/4TB/.local/share/instructlab/datasets/2025-02-10_062843/skills_train_msgs_2025-02-10T06_30_07.jsonl","data_output_path":"/mnt/4TB/.local/share/instructlab/internal","max_seq_len":10000,"model_path":"/mnt/4TB/.local/share/instructlab/phased/phase1/checkpoints/hf_format/samples_1525","chat_tmpl_path":"/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_legacy_tmpl.py","num_cpu_procs":16}
INFO 2025-02-10 07:05:56,084 root:879: Special tokens: eos: [0], pad: [49153], bos: [49152], system: [49154], user: [49155], assistant: [49156]
Generating train split: 344673 examples [00:02, 139674.49 examples/s]
--- Logging error ---
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 2013, in _prepare_split_single
    writer.write_table(table)
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/arrow_writer.py", line 585, in write_table
    pa_table = table_cast(pa_table, self._schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2281, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2240, in cast_table_to_schema
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2240, in <listcomp>
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1795, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1795, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2098, in cast_array_to_feature
    return array_cast(
           ^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1948, in array_cast
    raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")
TypeError: Couldn't cast array of type string to null
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/data_process.py", line 288, in main
    data = load_dataset("json", data_files=args.data_path, split="train")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/load.py", line 2628, in load_dataset
    builder_instance.download_and_prepare(
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1029, in download_and_prepare
    self._download_and_prepare(
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1124, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1884, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 2040, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 261, in _run_phase
    _training_phase(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 563, in _training_phase
    run_training(train_args=train_args, torch_args=torch_args)
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training
    return run_training(torch_args=torch_args, train_args=train_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 672, in run_training
    dp.main(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/data_process.py", line 291, in main
    raise Exception(
Exception: Malformed or missing data, please ensure that your dataset is not empty and correctly formatted
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib64/python3.11/logging/__init__.py", line 1110, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/logging/__init__.py", line 953, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py", line 19, in format
    return super().format(record)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/logging/__init__.py", line 687, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/logging/__init__.py", line 377, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
TypeError: not all arguments converted during string formatting
Call stack:
  File "/opt/app-root/bin/ilab", line 8, in <module>
    sys.exit(ilab())
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper
    return f(*args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py", line 469, in train
    accelerated_train.accelerated_train(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 202, in accelerated_train
    _run_phased_training(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 432, in _run_phased_training
    _run_phase(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 276, in _run_phase
    logger.error("Failed during training loop: ", e)
Message: 'Failed during training loop: '
Arguments: (Exception('Malformed or missing data, please ensure that your dataset is not empty and correctly formatted'),)
Accelerated Training failed with 1

Example:

https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9086908356

Expected behavior

Training should work

Screenshots

Attached Image

Device Info (please complete the following information):

Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]

I have tried with 8xL40s on AWS and 8xA100 IBM Cloud

OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
InstructLab Version: [output of \\\{{{}ilab --version{}}}]
Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
- ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

[cloud-user@instructlab-ci-8xa100-preserve ~]$ sudo bootc status --format json | jq .status.booted.image.image.image
"registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4"

[cloud-user@instructlab-ci-8xa100-preserve cloud-user]$ ilab system info
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Platform:
sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.50.1.el9_4.x86_64
platform.machine: x86_64
platform.node: instructlab-ci-8xa100-preserve
platform.python_version: 3.11.7
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
memory.total: 1259.87 GB
memory.available: 1248.68 GB
memory.used: 3.27 GB

InstructLab:
instructlab.version: 0.23.1
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.7.0
instructlab-training.version: 0.7.0

Torch:
torch.version: 2.5.1
torch.backends.cpu.capability: AVX512
torch.version.cuda: 12.4
torch.version.hip: None
torch.cuda.available: True
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
torch.cuda.bf16: True
torch.cuda.current.device: 0
torch.cuda.0.name: NVIDIA A100-SXM4-80GB
torch.cuda.0.free: 78.7 GB
torch.cuda.0.total: 79.1 GB
torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.1.name: NVIDIA A100-SXM4-80GB
torch.cuda.1.free: 78.7 GB
torch.cuda.1.total: 79.1 GB
torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.2.name: NVIDIA A100-SXM4-80GB
torch.cuda.2.free: 78.7 GB
torch.cuda.2.total: 79.1 GB
torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.3.name: NVIDIA A100-SXM4-80GB
torch.cuda.3.free: 78.7 GB
torch.cuda.3.total: 79.1 GB
torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.4.name: NVIDIA A100-SXM4-80GB
torch.cuda.4.free: 78.7 GB
torch.cuda.4.total: 79.1 GB
torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.5.name: NVIDIA A100-SXM4-80GB
torch.cuda.5.free: 78.7 GB
torch.cuda.5.total: 79.1 GB
torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.6.name: NVIDIA A100-SXM4-80GB
torch.cuda.6.free: 78.7 GB
torch.cuda.6.total: 79.1 GB
torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.7.name: NVIDIA A100-SXM4-80GB
torch.cuda.7.free: 78.7 GB
torch.cuda.7.total: 79.1 GB
torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:
llama_cpp_python.version: 0.3.2
llama_cpp_python.supports_gpu_offload: True

Bug impact
Can't use instructlab to create a model

Known workaround

Please add any known workarounds.

Can't used mixin dataset?

Additional context

<your text here>
…

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates