-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
rhelai-1.4.1
-
None
-
False
-
-
False
-
-
ilab model train will crash with GPU OOM on AWS p4d.24xlarge instance with various different profiles tried - NVIDIA A100 X8 (which is matching the instance configuration), and also various others like NVIDIA L40S X4, NVIDIA A100 X2.
ilab 0.23.2
Commands ran:
time ilab data generate shuf -n 15000 .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_*.jsonl > .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2
Crashed with:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.33 GiB. GPU 6 has a total capacity of 39.38 GiB of which 571.38 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 34.54 GiB is allocated by PyTorch, and 2.65 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
- duplicates
-
RHELAI-2593 Training fails on AWS instance p4d.24xlarge with x8 A100-SXM4-40GB
-
- Verified
-