Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3447

ilab model train crashing on AWS's p4d.24xlarge (8 * A100 40 GB GPU)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • rhelai-1.4.1
    • InstructLab - Training
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      ilab model train will crash with GPU OOM on AWS p4d.24xlarge instance with various different profiles tried - NVIDIA A100 X8 (which is matching the instance configuration), and also various others like NVIDIA L40S X4, NVIDIA A100 X2.

       

      ilab 0.23.2

       

      Commands ran:

       

      time ilab data generate
      shuf -n 15000 .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_*.jsonl > .local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl
      
      time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2
      

       

       

       

      Crashed with:

       

       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.33 GiB. GPU 6 has a total capacity of 39.38 GiB of which 571.38 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 34.54 GiB is allocated by PyTorch, and 2.65 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

       

              Unassigned Unassigned
              fzatlouk@redhat.com František Zatloukal
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: