Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.4.1
Component/s: InstructLab - Training
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

ilab model train will crash with GPU OOM on AWS p4d.24xlarge instance with various different profiles tried - NVIDIA A100 X8 (which is matching the instance configuration), and also various others like NVIDIA L40S X4, NVIDIA A100 X2.

ilab 0.23.2

Commands ran:

time ilab data generate
shuf -n 15000 .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_*.jsonl > .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl

time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2

Crashed with:

 torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.33 GiB. GPU 6 has a total capacity of 39.38 GiB of which 571.38 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 34.54 GiB is allocated by PyTorch, and 2.65 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

rhelai141_ilab_train_a100.txt
61 kB
2025/02/20 1:24 PM

duplicates

RHELAI-2593 Training fails on AWS instance p4d.24xlarge with x8 A100-SXM4-40GB

Verified

Assignee:: Unassigned

Reporter:: František Zatloukal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/02/20 12:50 PM

Updated:: 2025/02/20 3:02 PM

Resolved:: 2025/02/20 3:02 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide