Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4333

IntructLab training: checkpoint_at_epoch: false -> the training results are not saved

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • rhelai-1.5.1
    • InstructLab - Training
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      To Reproduce

      Steps to reproduce the behavior:

      1. init default InstructLab config: ilab config init
      2. Change:
        • checkpoint_at_epoch: false in train section
      3. Run:
        • ilab train —data-path <path to training data>

      After the training finishes the training results are not saved. The folders there are no new folders in checkpoints like full_state or hf_format

      Attaching the log from the training.

      Expected behavior

      • The resulted model is saved in checkpoints/hf_format

      Device Info (please complete the following information):

      • Hardware Specs: x86_64, 8xNvidia A100, 
      • OS Version: RHEL AI 1.5
        • registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5-1747337172
      • InstructLab version 
        • ilab system info:

      ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
      ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
      ggml_cuda_init: found 8 CUDA devices:
       Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
       Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
       Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
       Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
       Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
       Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
       Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
       Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Platform:
      {{  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Ha}}
      t 11.4.1-3)]
       sys.platform: linux
       os.name: posix
       platform.release: 5.14.0-427.65.1.el9_4.x86_64
       platform.machine: x86_64
       platform.node: ip-10-31-83-254.us-east-1.compute.internal
       platform.python_version: 3.11.7
       os-release.ID: rhel
       os-release.VERSION_ID: 9.4
       os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
       memory.total: 1121.81 GB
       memory.available: 1110.19 GB
       memory.used: 4.80 GB

      InstructLab:
       instructlab.version: 0.26.1
       instructlab-dolomite.version: 0.2.0
       instructlab-eval.version: 0.5.1
       instructlab-quantize.version: 0.1.0
       instructlab-schema.version: 0.4.2
       instructlab-sdg.version: 0.8.2
        instructlab-training.version: 0.10.2

      Bug impact

      • Not able to use the trained model.

      Additional context

      • We would like to train 8 epochs, but it needs at least 1T disk space when saving each epoch and we have some limitation of disk space on the system.
      • We tried to set keep_last_checkpoint_only parameter to true, but it seems that this parameter is ignored, see RHELAI-4332.

              Unassigned Unassigned
              rh-ee-vkadlec Vladimír Kadlec
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: