Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3347

RHELAI 1.4 Training Fails with load_dataset error

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Critical
    • Approved

      To Reproduce Steps to reproduce the behavior:

      Training Phase 2/2...
      TrainingArgs for current phase: TrainingArgs(model_path='/mnt/4TB/.local/share/instructlab/phased/phase1/checkpoints/hf_format/samples_1525', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', use_legacy_tmpl=True, data_path='/mnt/4TB/.local/share/instructlab/datasets/2025-02-10_062843/skills_train_msgs_2025-02-10T06_30_07.jsonl', ckpt_output_dir='/mnt/4TB/.local/share/instructlab/phased/phase2/checkpoints', data_output_dir='/mnt/4TB/.local/share/instructlab/internal', max_seq_len=10000, max_batch_len=60000, num_epochs=1, effective_batch_size=3840, save_samples=0, learning_rate=6e-06, warmup_steps=25, random_seed=42, use_dolomite=True, is_padding_free=False, checkpoint_at_epoch=True, accelerate_full_state_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), fsdp_options=FSDPOptions(cpu_offload_params=False, sharding_strategy=<ShardingStrategies.SHARD_GRAD_OP: 'SHARD_GRAD_OP'>), distributed_backend=<DistributedBackend.FSDP: 'fsdp'>, disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>), process_data=True, keep_last_checkpoint_only=False)
       data arguments are:
      {"data_path":"/mnt/4TB/.local/share/instructlab/datasets/2025-02-10_062843/skills_train_msgs_2025-02-10T06_30_07.jsonl","data_output_path":"/mnt/4TB/.local/share/instructlab/internal","max_seq_len":10000,"model_path":"/mnt/4TB/.local/share/instructlab/phased/phase1/checkpoints/hf_format/samples_1525","chat_tmpl_path":"/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_legacy_tmpl.py","num_cpu_procs":16}
      INFO 2025-02-10 07:05:56,084 root:879: Special tokens: eos: [0], pad: [49153], bos: [49152], system: [49154], user: [49155], assistant: [49156]
      Generating train split: 344673 examples [00:02, 139674.49 examples/s]
      --- Logging error ---
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 2013, in _prepare_split_single
          writer.write_table(table)
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/arrow_writer.py", line 585, in write_table
          pa_table = table_cast(pa_table, self._schema)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2281, in table_cast
          return cast_table_to_schema(table, schema)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2240, in cast_table_to_schema
          arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2240, in <listcomp>
          arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1795, in wrapper
          return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1795, in <listcomp>
          return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 2098, in cast_array_to_feature
          return array_cast(
                 ^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1797, in wrapper
          return func(array, *args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/table.py", line 1948, in array_cast
          raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")
      TypeError: Couldn't cast array of type string to null
      The above exception was the direct cause of the following exception:
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/data_process.py", line 288, in main
          data = load_dataset("json", data_files=args.data_path, split="train")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/load.py", line 2628, in load_dataset
          builder_instance.download_and_prepare(
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1029, in download_and_prepare
          self._download_and_prepare(
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1124, in _download_and_prepare
          self._prepare_split(split_generator, **prepare_split_kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 1884, in _prepare_split
          for job_id, done, content in self._prepare_split_single(
        File "/opt/app-root/lib64/python3.11/site-packages/datasets/builder.py", line 2040, in _prepare_split_single
          raise DatasetGenerationError("An error occurred while generating the dataset") from e
      datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
      During handling of the above exception, another exception occurred:
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 261, in _run_phase
          _training_phase(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 563, in _training_phase
          run_training(train_args=train_args, torch_args=torch_args)
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training
          return run_training(torch_args=torch_args, train_args=train_args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 672, in run_training
          dp.main(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/data_process.py", line 291, in main
          raise Exception(
      Exception: Malformed or missing data, please ensure that your dataset is not empty and correctly formatted
      During handling of the above exception, another exception occurred:
      Traceback (most recent call last):
        File "/usr/lib64/python3.11/logging/__init__.py", line 1110, in emit
          msg = self.format(record)
                ^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/logging/__init__.py", line 953, in format
          return fmt.format(record)
                 ^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py", line 19, in format
          return super().format(record)
                 ^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/logging/__init__.py", line 687, in format
          record.message = record.getMessage()
                           ^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/logging/__init__.py", line 377, in getMessage
          msg = msg % self.args
                ~~~~^~~~~~~~~~~
      TypeError: not all arguments converted during string formatting
      Call stack:
        File "/opt/app-root/bin/ilab", line 8, in <module>
          sys.exit(ilab())
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__
          return self.main(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1082, in main
          rv = self.invoke(ctx)
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
          return _process_result(sub_ctx.command.invoke(sub_ctx))
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
          return _process_result(sub_ctx.command.invoke(sub_ctx))
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke
          return ctx.invoke(self.callback, **ctx.params)
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 788, in invoke
          return __callback(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
          return f(get_current_context(), *args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper
          return f(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py", line 469, in train
          accelerated_train.accelerated_train(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 202, in accelerated_train
          _run_phased_training(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 432, in _run_phased_training
          _run_phase(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 276, in _run_phase
          logger.error("Failed during training loop: ", e)
      Message: 'Failed during training loop: '
      Arguments: (Exception('Malformed or missing data, please ensure that your dataset is not empty and correctly formatted'),)
      Accelerated Training failed with 1
      

      Example:

      https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9086908356

      Expected behavior

      • Training should work

      Screenshots

      • Attached Image

      Device Info (please complete the following information):

      • Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]

      I have tried with 8xL40s on AWS and 8xA100 IBM Cloud

      • OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
      • InstructLab Version: [output of \\\{{{}ilab --version{}}}]
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
        • ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

      [cloud-user@instructlab-ci-8xa100-preserve ~]$ sudo bootc status --format json | jq .status.booted.image.image.image
      "registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4"

      [cloud-user@instructlab-ci-8xa100-preserve cloud-user]$ ilab system info
      ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
      ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
      ggml_cuda_init: found 8 CUDA devices:
      Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Platform:
      sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
      sys.platform: linux
      os.name: posix
      platform.release: 5.14.0-427.50.1.el9_4.x86_64
      platform.machine: x86_64
      platform.node: instructlab-ci-8xa100-preserve
      platform.python_version: 3.11.7
      os-release.ID: rhel
      os-release.VERSION_ID: 9.4
      os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
      memory.total: 1259.87 GB
      memory.available: 1248.68 GB
      memory.used: 3.27 GB

      InstructLab:
      instructlab.version: 0.23.1
      instructlab-dolomite.version: 0.2.0
      instructlab-eval.version: 0.5.1
      instructlab-quantize.version: 0.1.0
      instructlab-schema.version: 0.4.2
      instructlab-sdg.version: 0.7.0
      instructlab-training.version: 0.7.0

      Torch:
      torch.version: 2.5.1
      torch.backends.cpu.capability: AVX512
      torch.version.cuda: 12.4
      torch.version.hip: None
      torch.cuda.available: True
      torch.backends.cuda.is_built: True
      torch.backends.mps.is_built: False
      torch.backends.mps.is_available: False
      torch.cuda.bf16: True
      torch.cuda.current.device: 0
      torch.cuda.0.name: NVIDIA A100-SXM4-80GB
      torch.cuda.0.free: 78.7 GB
      torch.cuda.0.total: 79.1 GB
      torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.1.name: NVIDIA A100-SXM4-80GB
      torch.cuda.1.free: 78.7 GB
      torch.cuda.1.total: 79.1 GB
      torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.2.name: NVIDIA A100-SXM4-80GB
      torch.cuda.2.free: 78.7 GB
      torch.cuda.2.total: 79.1 GB
      torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.3.name: NVIDIA A100-SXM4-80GB
      torch.cuda.3.free: 78.7 GB
      torch.cuda.3.total: 79.1 GB
      torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.4.name: NVIDIA A100-SXM4-80GB
      torch.cuda.4.free: 78.7 GB
      torch.cuda.4.total: 79.1 GB
      torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.5.name: NVIDIA A100-SXM4-80GB
      torch.cuda.5.free: 78.7 GB
      torch.cuda.5.total: 79.1 GB
      torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.6.name: NVIDIA A100-SXM4-80GB
      torch.cuda.6.free: 78.7 GB
      torch.cuda.6.total: 79.1 GB
      torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.7.name: NVIDIA A100-SXM4-80GB
      torch.cuda.7.free: 78.7 GB
      torch.cuda.7.total: 79.1 GB
      torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

      llama_cpp_python:
      llama_cpp_python.version: 0.3.2
      llama_cpp_python.supports_gpu_offload: True

      Bug impact
      Can't use instructlab to create a model

      Known workaround

      • Please add any known workarounds.

      Can't used mixin dataset?

      Additional context

      • <your text here>

              osilkin@redhat.com Oleg Silkin
              dmcphers@redhat.com Dan McPherson
              Ali Maredia
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: