-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
rhelai-1.5
-
False
-
-
False
-
Important
To Reproduce Steps to reproduce the behavior:
- Install RHEL AI 1.5 on IBM Cloud on an H100
- Should inherit this profile (which is out of date) https://github.com/instructlab/instructlab/blob/main/src/instructlab/profiles/nvidia/h100/h100_x8.yaml#L120
- Run training with the defaults (or use the 45k max_batch_length as recommended by RedHat, using the updated profiles sent over slack).
- Defaults will also include a new sharding strategy `HYBRID_SHARD`
- See that during phase2 of training, the torch process fails with an out of memory error
```
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.90 GiB. GPU 2 has a total capacity of 79.10 GiB of which 9.17 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 61.16 GiB is allocated by PyTorch, and 2.46 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```
Expected behavior
- Expect training to go through with the recommended training parameters. (45k max_match_length).
- Expect that the profiles on the RHEL AI machine are properly tuned.
Device Info (please complete the following information):
- Hardware Specs: 8xH100
- OS Version: RHEL AI 1.5
- InstructLab Version: 1.5
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
- "registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5"
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
-
- ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Platform:
sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.65.1.el9_4.x86_64
platform.machine: x86_64
platform.node: dev-rhel-ai-training-client-h100-2
platform.python_version: 3.11.7
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
memory.total: 1763.83 GB
memory.available: 1739.22 GB
memory.used: 12.81 GB
InstructLab:
instructlab.version: 0.26.1
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.8.2
instructlab-training.version: 0.10.2
Torch:
torch.version: 2.6.0
torch.backends.cpu.capability: AVX512
torch.version.cuda: 12.4
torch.version.hip: None
torch.cuda.available: True
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
torch.cuda.bf16: True
torch.cuda.current.device: 0
torch.cuda.0.name: NVIDIA H100 80GB HBM3
torch.cuda.0.free: 78.6 GB
torch.cuda.0.total: 79.1 GB
torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.1.name: NVIDIA H100 80GB HBM3
torch.cuda.1.free: 78.6 GB
torch.cuda.1.total: 79.1 GB
torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.2.name: NVIDIA H100 80GB HBM3
torch.cuda.2.free: 78.6 GB
torch.cuda.2.total: 79.1 GB
torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.3.name: NVIDIA H100 80GB HBM3
torch.cuda.3.free: 78.6 GB
torch.cuda.3.total: 79.1 GB
torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.4.name: NVIDIA H100 80GB HBM3
torch.cuda.4.free: 78.6 GB
torch.cuda.4.total: 79.1 GB
torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.5.name: NVIDIA H100 80GB HBM3
torch.cuda.5.free: 78.6 GB
torch.cuda.5.total: 79.1 GB
torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.6.name: NVIDIA H100 80GB HBM3
torch.cuda.6.free: 78.6 GB
torch.cuda.6.total: 79.1 GB
torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.7.name: NVIDIA H100 80GB HBM3
torch.cuda.7.free: 78.6 GB
torch.cuda.7.total: 79.1 GB
torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
llama_cpp_python:
llama_cpp_python.version: 0.3.6
llama_cpp_python.supports_gpu_offload: True
Bug impact
- Recommended training parameters are not working as expected. The cloud team has further tuned it down to 30k max_batch_length which will increase the time it takes to run training, preventing our ability to rebase to the latest RHEL AI version.
Known workaround
- Setting max_batch_length to 30k allows training to complete
Additional context
- H100 profile sent by Mustafa Eyceoz in slack
chat:
context: default
- Directory where chat logs are stored
logs_dir: ~/.local/share/instructlab/chatlogs - The maximum number of tokens that can be generated in the chat completion
max_tokens: null - Directory where model to be used for chatting with is stored
model: ~/.cache/instructlab/models/granite-3.1-8b-lab-v2.1
session: null - visual mode
vi_mode: false - renders vertical overflow if enabled, displays ellipses otherwise
visible_overflow: true
evaluate: - Base taxonomy branch
base_branch: null - Directory where the model to be evaluated is stored
base_model: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1 - Taxonomy branch containing custom skills/knowledge that should be used for evaluation runs
branch: null - Number of GPUs to use for running evaluation
gpus: 8 - MMLU benchmarking settings
mmlu: - batch size for evaluation.
- Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory
batch_size: auto - number of question-answer pairs provided in the context preceding the question used for evaluation
few_shots: 5 - Settings to run MMLU against a branch of taxonomy containing
- custom skills/knowledge used for training
mmlu_branch: - Directory where custom MMLU tasks are stored
tasks_dir: ~/.local/share/instructlab/datasets
model: null - multi-turn benchmarking settings for skills
mt_bench: - Directory where model to be used as judge is stored
judge_model: ~/.cache/instructlab/models/prometheus-8x7b-v2-0
max_workers: auto - Directory where evaluation results are stored
output_dir: ~/.local/share/instructlab/internal/eval_data/mt_bench - Settings to run MT-Bench against a branch of taxonomy containing
- custom skills/knowledge used for training
mt_bench_branch: - Directory where model to be used as judge is stored
judge_model: ~/.cache/instructlab/models/prometheus-8x7b-v2-0 - Path to where base taxonomy is stored
taxonomy_path: ~/.local/share/instructlab/taxonomy
general:
debug_level: 0
log_level: INFO - The default student model to use when training
student_model_id: 'granite-3.1-starter-v2'
generate: - maximum number of words per chunk
chunk_word_count: 1000 - Teacher model that will be used to synthetically generate training data
model: ~/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1 - Number of CPU cores to use for generation
num_cpus: 2 - Number of Batches to send for generation on each core.
batch_size: 256 - Directory where generated datasets are stored
output_dir: ~/.local/share/instructlab/datasets - Directory where pipeline config files are stored
pipeline: /usr/share/instructlab/sdg/pipelines/agentic - Path to prompt file to be used for generation
prompt_file: ~/.local/share/instructlab/internal/prompt.txt - The total number of instructions to be generated
sdg_scale_factor: 30
seed_file: ~/.local/share/instructlab/internal/seed_tasks.json - Branch of taxonomy used to calculate diff against
taxonomy_base: empty - Directory where taxonomy is stored and accessed from
taxonomy_path: ~/.local/share/instructlab/taxonomy - Teacher model specific settings
teacher: - Serving backend to use to host the teacher model
backend: vllm - Chat template to supply to the teacher model. Possible values:
- - Custom chat template string
- - Auto: Uses default for serving backend
chat_template: 'tokenizer' - host and port where teacher model is being served
host_port: 127.0.0.1:8000 - Llamacpp serving settings
llama_cpp: - number of model layers to offload to GPU
- -1 means all
gpu_layers: -1 - the family of model being served - used to determine the appropriate chat template
llm_family: '' - maximum number of tokens that can be processed by the model
max_ctx_size: 4096 - Path to teacher model that will be used to synthetically generate training data
model_path: ~/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1 - vLLM serving settings
vllm: - number of GPUs to allocate to vLLM
gpus: 8 - the family of model being served - used to determine the appropriate chat template
llm_family: 'mixtral' - additional arguments to be supplied directly to vLLM
vllm_args:
- --max-num-seqs
- '512'
- --enable-lora
- --max-lora-rank
- '64'
- --dtype
- bfloat16
- --lora-dtype
- bfloat16
- --fully-sharded-loras
- --lora-modules
- Directory where LoRA adapter for skills is stored
- skill-classifier-v3-clm=$HOME/.cache/instructlab/models/skills-adapter-v3
- Directory where LoRA adapter for knowledge is stored
- text-classifier-knowledge-v3-clm=$HOME/.cache/instructlab/models/knowledge-adapter-v3
models: - id: llama-3.3
family: llama
path: ~/.cache/instructlab/models/meta-llama/Llama-3.3-70B-Instruct - id: granite-3.1-starter-v2
family: granite
path: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1
system_prompt: 'You are a Red Hat® Instruct Model, an AI language model developed by Red Hat and IBM Research based on the granite-3.1-8b-base model. Your primary role is to serve as a chat assistant.'
serve:
- Serving backend to use to host the model
backend: vllm - Chat template to supply to the served model. Possible values:
- - Custom chat template string
- - Auto: Uses default for serving backend
chat_template: auto - host and port where the model is being served
host_port: 127.0.0.1:8000 - Llamacpp serving settings
llama_cpp: - number of model layers to offload to GPU
- -1 means all
gpu_layers: -1 - the family of model being served - used to determine the appropriate chat template
llm_family: '' - maximum number of tokens that can be processed by the model
max_ctx_size: 4096 - Path to model that will be served for inference
model_path: ~/.cache/instructlab/models/granite-3.1-8b-lab-v2.1 - vLLM serving settings
vllm:
gpus: 8 - the family of model being served - used to determine the appropriate chat template
llm_family: '' - additional arguments to be supplied directly to vLLM
vllm_args: ["--tensor-parallel-size", "8"]
train:
additional_args:
learning_rate: 6e-6
lora_alpha: 32
lora_dropout: 0.1
warmup_steps: 25
use_dolomite: false
device: cuda
pipeline: accelerated
ckpt_output_dir: ~/.local/share/instructlab/checkpoints
data_output_dir: ~/.local/share/instructlab/internal
data_path: ~/.local/share/instructlab/datasets
effective_batch_size: 128
is_padding_free: false
lora_quantize_dtype: null
lora_rank: 0
max_batch_len: 45000
max_seq_len: 42000
model_path: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1
nproc_per_node: 8
num_epochs: 8
phased_mt_bench_judge: ~/.cache/instructlab/models/prometheus-8x7b-v2-0
phased_phase1_num_epochs: 7
phased_phase2_num_epochs: 7
phased_phase2_learning_rate: 2e-5
checkpoint_at_epoch: true
save_samples: 0
metadata:
gpu_manufacturer: Nvidia
gpu_family: H100
gpu_count: 8
gpu_sku: [80GB HBM3, NVL, PCIe]
version: 1.0.0