Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4357

H100 Nvidia profile failing to do training with OOM on RHEL AI 1.5

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Important

      To Reproduce Steps to reproduce the behavior:

      1. Install RHEL AI 1.5 on IBM Cloud on an H100
      2. Should inherit this profile (which is out of date) https://github.com/instructlab/instructlab/blob/main/src/instructlab/profiles/nvidia/h100/h100_x8.yaml#L120
      3. Run training with the defaults (or use the 45k max_batch_length as recommended by RedHat, using the updated profiles sent over slack). 
        1. Defaults will also include a new sharding strategy `HYBRID_SHARD`
      4. See that during phase2 of training, the torch process fails with an out of memory error

      ```

      [rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.90 GiB. GPU 2 has a total capacity of 79.10 GiB of which 9.17 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 61.16 GiB is allocated by PyTorch, and 2.46 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

      ```

       

      Expected behavior

      • Expect training to go through with the recommended training parameters. (45k max_match_length).
      • Expect that the profiles on the RHEL AI machine are properly tuned. 

       

      Device Info (please complete the following information):

      • Hardware Specs: 8xH100
      • OS Version: RHEL AI 1.5
      • InstructLab Version: 1.5
      • Provide the output of these two commands:
        • ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

      ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no

      ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no

      ggml_cuda_init: found 8 CUDA devices:

        Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

        Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

        Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

        Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

        Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

        Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

        Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

        Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

      Platform:

        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]

        sys.platform: linux

        os.name: posix

        platform.release: 5.14.0-427.65.1.el9_4.x86_64

        platform.machine: x86_64

        platform.node: dev-rhel-ai-training-client-h100-2

        platform.python_version: 3.11.7

        os-release.ID: rhel

        os-release.VERSION_ID: 9.4

        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)

        memory.total: 1763.83 GB

        memory.available: 1739.22 GB

        memory.used: 12.81 GB

       

      InstructLab:

        instructlab.version: 0.26.1

        instructlab-dolomite.version: 0.2.0

        instructlab-eval.version: 0.5.1

        instructlab-quantize.version: 0.1.0

        instructlab-schema.version: 0.4.2

        instructlab-sdg.version: 0.8.2

        instructlab-training.version: 0.10.2

       

      Torch:

        torch.version: 2.6.0

        torch.backends.cpu.capability: AVX512

        torch.version.cuda: 12.4

        torch.version.hip: None

        torch.cuda.available: True

        torch.backends.cuda.is_built: True

        torch.backends.mps.is_built: False

        torch.backends.mps.is_available: False

        torch.cuda.bf16: True

        torch.cuda.current.device: 0

        torch.cuda.0.name: NVIDIA H100 80GB HBM3

        torch.cuda.0.free: 78.6 GB

        torch.cuda.0.total: 79.1 GB

        torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.1.name: NVIDIA H100 80GB HBM3

        torch.cuda.1.free: 78.6 GB

        torch.cuda.1.total: 79.1 GB

        torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.2.name: NVIDIA H100 80GB HBM3

        torch.cuda.2.free: 78.6 GB

        torch.cuda.2.total: 79.1 GB

        torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.3.name: NVIDIA H100 80GB HBM3

        torch.cuda.3.free: 78.6 GB

        torch.cuda.3.total: 79.1 GB

        torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.4.name: NVIDIA H100 80GB HBM3

        torch.cuda.4.free: 78.6 GB

        torch.cuda.4.total: 79.1 GB

        torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.5.name: NVIDIA H100 80GB HBM3

        torch.cuda.5.free: 78.6 GB

        torch.cuda.5.total: 79.1 GB

        torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.6.name: NVIDIA H100 80GB HBM3

        torch.cuda.6.free: 78.6 GB

        torch.cuda.6.total: 79.1 GB

        torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

        torch.cuda.7.name: NVIDIA H100 80GB HBM3

        torch.cuda.7.free: 78.6 GB

        torch.cuda.7.total: 79.1 GB

        torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

       

      llama_cpp_python:

        llama_cpp_python.version: 0.3.6

        llama_cpp_python.supports_gpu_offload: True

      Bug impact

      • Recommended training parameters are not working as expected. The cloud team has further tuned it down to 30k max_batch_length which will increase the time it takes to run training, preventing our ability to rebase to the latest RHEL AI version. 

      Known workaround

      • Setting max_batch_length to 30k allows training to complete

      Additional context

      • H100 profile sent by Mustafa Eyceoz in slack 

      chat:
      context: default

      1. Directory where chat logs are stored
        logs_dir: ~/.local/share/instructlab/chatlogs
      2. The maximum number of tokens that can be generated in the chat completion
        max_tokens: null
      3. Directory where model to be used for chatting with is stored
        model: ~/.cache/instructlab/models/granite-3.1-8b-lab-v2.1
        session: null
      4. visual mode
        vi_mode: false
      5. renders vertical overflow if enabled, displays ellipses otherwise
        visible_overflow: true
        evaluate:
      6. Base taxonomy branch
        base_branch: null
      7. Directory where the model to be evaluated is stored
        base_model: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1
      8. Taxonomy branch containing custom skills/knowledge that should be used for evaluation runs
        branch: null
      9. Number of GPUs to use for running evaluation
        gpus: 8
      10. MMLU benchmarking settings
        mmlu:
      11. batch size for evaluation.
      12. Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory
        batch_size: auto
      13. number of question-answer pairs provided in the context preceding the question used for evaluation
        few_shots: 5
      14. Settings to run MMLU against a branch of taxonomy containing
      15. custom skills/knowledge used for training
        mmlu_branch:
      16. Directory where custom MMLU tasks are stored
        tasks_dir: ~/.local/share/instructlab/datasets
        model: null
      17. multi-turn benchmarking settings for skills
        mt_bench:
      18. Directory where model to be used as judge is stored
        judge_model: ~/.cache/instructlab/models/prometheus-8x7b-v2-0
        max_workers: auto
      19. Directory where evaluation results are stored
        output_dir: ~/.local/share/instructlab/internal/eval_data/mt_bench
      20. Settings to run MT-Bench against a branch of taxonomy containing
      21. custom skills/knowledge used for training
        mt_bench_branch:
      22. Directory where model to be used as judge is stored
        judge_model: ~/.cache/instructlab/models/prometheus-8x7b-v2-0
      23. Path to where base taxonomy is stored
        taxonomy_path: ~/.local/share/instructlab/taxonomy
        general:
        debug_level: 0
        log_level: INFO
      24. The default student model to use when training
        student_model_id: 'granite-3.1-starter-v2'
        generate:
      25. maximum number of words per chunk
        chunk_word_count: 1000
      26. Teacher model that will be used to synthetically generate training data
        model: ~/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
      27. Number of CPU cores to use for generation
        num_cpus: 2
      28. Number of Batches to send for generation on each core.
        batch_size: 256
      29. Directory where generated datasets are stored
        output_dir: ~/.local/share/instructlab/datasets
      30. Directory where pipeline config files are stored
        pipeline: /usr/share/instructlab/sdg/pipelines/agentic
      31. Path to prompt file to be used for generation
        prompt_file: ~/.local/share/instructlab/internal/prompt.txt
      32. The total number of instructions to be generated
        sdg_scale_factor: 30
        seed_file: ~/.local/share/instructlab/internal/seed_tasks.json
      33. Branch of taxonomy used to calculate diff against
        taxonomy_base: empty
      34. Directory where taxonomy is stored and accessed from
        taxonomy_path: ~/.local/share/instructlab/taxonomy
      35. Teacher model specific settings
        teacher:
      36. Serving backend to use to host the teacher model
        backend: vllm
      37. Chat template to supply to the teacher model. Possible values:
      38. - Custom chat template string
      39. - Auto: Uses default for serving backend
        chat_template: 'tokenizer'
      40. host and port where teacher model is being served
        host_port: 127.0.0.1:8000
      41. Llamacpp serving settings
        llama_cpp:
      42. number of model layers to offload to GPU
      43. -1 means all
        gpu_layers: -1
      44. the family of model being served - used to determine the appropriate chat template
        llm_family: ''
      45. maximum number of tokens that can be processed by the model
        max_ctx_size: 4096
      46. Path to teacher model that will be used to synthetically generate training data
        model_path: ~/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
      47. vLLM serving settings
        vllm:
      48. number of GPUs to allocate to vLLM
        gpus: 8
      49. the family of model being served - used to determine the appropriate chat template
        llm_family: 'mixtral'
      50. additional arguments to be supplied directly to vLLM
        vllm_args:
      • --max-num-seqs
      • '512'
      • --enable-lora
      • --max-lora-rank
      • '64'
      • --dtype
      • bfloat16
      • --lora-dtype
      • bfloat16
      • --fully-sharded-loras
      • --lora-modules
      1. Directory where LoRA adapter for skills is stored
      • skill-classifier-v3-clm=$HOME/.cache/instructlab/models/skills-adapter-v3
      1. Directory where LoRA adapter for knowledge is stored
      • text-classifier-knowledge-v3-clm=$HOME/.cache/instructlab/models/knowledge-adapter-v3
        models:
      • id: llama-3.3
        family: llama
        path: ~/.cache/instructlab/models/meta-llama/Llama-3.3-70B-Instruct
      • id: granite-3.1-starter-v2
        family: granite
        path: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1
        system_prompt: 'You are a Red Hat® Instruct Model, an AI language model developed by Red Hat and IBM Research based on the granite-3.1-8b-base model. Your primary role is to serve as a chat assistant.'
        serve:
      1. Serving backend to use to host the model
        backend: vllm
      2. Chat template to supply to the served model. Possible values:
      3. - Custom chat template string
      4. - Auto: Uses default for serving backend
        chat_template: auto
      5. host and port where the model is being served
        host_port: 127.0.0.1:8000
      6. Llamacpp serving settings
        llama_cpp:
      7. number of model layers to offload to GPU
      8. -1 means all
        gpu_layers: -1
      9. the family of model being served - used to determine the appropriate chat template
        llm_family: ''
      10. maximum number of tokens that can be processed by the model
        max_ctx_size: 4096
      11. Path to model that will be served for inference
        model_path: ~/.cache/instructlab/models/granite-3.1-8b-lab-v2.1
      12. vLLM serving settings
        vllm:
        gpus: 8
      13. the family of model being served - used to determine the appropriate chat template
        llm_family: ''
      14. additional arguments to be supplied directly to vLLM
        vllm_args: ["--tensor-parallel-size", "8"]
        train:
        additional_args:
        learning_rate: 6e-6
        lora_alpha: 32
        lora_dropout: 0.1
        warmup_steps: 25
        use_dolomite: false
        device: cuda
        pipeline: accelerated
        ckpt_output_dir: ~/.local/share/instructlab/checkpoints
        data_output_dir: ~/.local/share/instructlab/internal
        data_path: ~/.local/share/instructlab/datasets
        effective_batch_size: 128
        is_padding_free: false
        lora_quantize_dtype: null
        lora_rank: 0
        max_batch_len: 45000
        max_seq_len: 42000
        model_path: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1
        nproc_per_node: 8
        num_epochs: 8
        phased_mt_bench_judge: ~/.cache/instructlab/models/prometheus-8x7b-v2-0
        phased_phase1_num_epochs: 7
        phased_phase2_num_epochs: 7
        phased_phase2_learning_rate: 2e-5
        checkpoint_at_epoch: true
        save_samples: 0
        metadata:
        gpu_manufacturer: Nvidia
        gpu_family: H100
        gpu_count: 8
        gpu_sku: [80GB HBM3, NVL, PCIe]
        version: 1.0.0

              rh-ee-akshirsa Atharva Kshirsagar
              kodieglosseribm Kodie Glosser
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: