Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: rhelai-1.5.1
Component/s: InstructLab - Training
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce

Steps to reproduce the behavior:

init default InstructLab config: ilab config init
Change:
- checkpoint_at_epoch: false in train section
Run:
- ilab train —data-path <path to training data>

After the training finishes the training results are not saved. The folders there are no new folders in checkpoints like full_state or hf_format

Attaching the log from the training.

Expected behavior

The resulted model is saved in checkpoints/hf_format

Device Info (please complete the following information):

Hardware Specs: x86_64, 8xNvidia A100,
OS Version: RHEL AI 1.5
- registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5-1747337172

InstructLab version
- ilab system info:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Platform:
{{ sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Ha}}
t 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.65.1.el9_4.x86_64
platform.machine: x86_64
platform.node: ip-10-31-83-254.us-east-1.compute.internal
platform.python_version: 3.11.7
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
memory.total: 1121.81 GB
memory.available: 1110.19 GB
memory.used: 4.80 GB

InstructLab:
instructlab.version: 0.26.1
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.8.2
instructlab-training.version: 0.10.2

Bug impact

Not able to use the trained model.

Additional context

We would like to train 8 epochs, but it needs at least 1T disk space when saving each epoch and we have some limitation of disk space on the system.
We tried to set keep_last_checkpoint_only parameter to true, but it seems that this parameter is ignored, see RHELAI-4332.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

train_log.zip
3.35 MB
2025/06/11 9:25 AM

Assignee:: Unassigned

Reporter:: Vladimír Kadlec

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/06/11 9:26 AM

Updated:: 2025/06/11 2:43 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates