Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: rhelai-1.5
Component/s: InstructLab - Evaluation, InstructLab - Training, RHELAI - IBMCloud
Labels:
- 1.5.z

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False

Severity:
Important

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Install RHEL AI 1.5 on IBM Cloud on an H100
Should inherit this profile (which is out of date) https://github.com/instructlab/instructlab/blob/main/src/instructlab/profiles/nvidia/h100/h100_x8.yaml#L120
Run training with the defaults (or use the 45k max_batch_length as recommended by RedHat, using the updated profiles sent over slack).
1. Defaults will also include a new sharding strategy `HYBRID_SHARD`
See that during phase2 of training, the torch process fails with an out of memory error

```

[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.90 GiB. GPU 2 has a total capacity of 79.10 GiB of which 9.17 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 61.16 GiB is allocated by PyTorch, and 2.46 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

```

Expected behavior

Expect training to go through with the recommended training parameters. (45k max_match_length).
Expect that the profiles on the RHEL AI machine are properly tuned.

Device Info (please complete the following information):

Hardware Specs: 8xH100
OS Version: RHEL AI 1.5
InstructLab Version: 1.5
Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
  - "registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5"

- ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no

ggml_cuda_init: found 8 CUDA devices:

Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

Platform:

sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]

sys.platform: linux

os.name: posix

platform.release: 5.14.0-427.65.1.el9_4.x86_64

platform.machine: x86_64

platform.node: dev-rhel-ai-training-client-h100-2

platform.python_version: 3.11.7

os-release.ID: rhel

os-release.VERSION_ID: 9.4

os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)

memory.total: 1763.83 GB

memory.available: 1739.22 GB

memory.used: 12.81 GB

InstructLab:

instructlab.version: 0.26.1

instructlab-dolomite.version: 0.2.0

instructlab-eval.version: 0.5.1

instructlab-quantize.version: 0.1.0

instructlab-schema.version: 0.4.2

instructlab-sdg.version: 0.8.2

instructlab-training.version: 0.10.2

Torch:

torch.version: 2.6.0

torch.backends.cpu.capability: AVX512

torch.version.cuda: 12.4

torch.version.hip: None

torch.cuda.available: True

torch.backends.cuda.is_built: True

torch.backends.mps.is_built: False

torch.backends.mps.is_available: False

torch.cuda.bf16: True

torch.cuda.current.device: 0

torch.cuda.0.name: NVIDIA H100 80GB HBM3

torch.cuda.0.free: 78.6 GB

torch.cuda.0.total: 79.1 GB

torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.1.name: NVIDIA H100 80GB HBM3

torch.cuda.1.free: 78.6 GB

torch.cuda.1.total: 79.1 GB

torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.2.name: NVIDIA H100 80GB HBM3

torch.cuda.2.free: 78.6 GB

torch.cuda.2.total: 79.1 GB

torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.3.name: NVIDIA H100 80GB HBM3

torch.cuda.3.free: 78.6 GB

torch.cuda.3.total: 79.1 GB

torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.4.name: NVIDIA H100 80GB HBM3

torch.cuda.4.free: 78.6 GB

torch.cuda.4.total: 79.1 GB

torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.5.name: NVIDIA H100 80GB HBM3

torch.cuda.5.free: 78.6 GB

torch.cuda.5.total: 79.1 GB

torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.6.name: NVIDIA H100 80GB HBM3

torch.cuda.6.free: 78.6 GB

torch.cuda.6.total: 79.1 GB

torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.7.name: NVIDIA H100 80GB HBM3

torch.cuda.7.free: 78.6 GB

torch.cuda.7.total: 79.1 GB

torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:

llama_cpp_python.version: 0.3.6

llama_cpp_python.supports_gpu_offload: True

Bug impact

Recommended training parameters are not working as expected. The cloud team has further tuned it down to 30k max_batch_length which will increase the time it takes to run training, preventing our ability to rebase to the latest RHEL AI version.

Known workaround

Setting max_batch_length to 30k allows training to complete

Additional context

H100 profile sent by Mustafa Eyceoz in slack

chat:
context: default

Directory where chat logs are stored
logs_dir: ~/.local/share/instructlab/chatlogs
The maximum number of tokens that can be generated in the chat completion
max_tokens: null
Directory where model to be used for chatting with is stored
model: ~/.cache/instructlab/models/granite-3.1-8b-lab-v2.1
session: null
visual mode
vi_mode: false
renders vertical overflow if enabled, displays ellipses otherwise
visible_overflow: true
evaluate:
Base taxonomy branch
base_branch: null
Directory where the model to be evaluated is stored
base_model: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1
Taxonomy branch containing custom skills/knowledge that should be used for evaluation runs
branch: null
Number of GPUs to use for running evaluation
gpus: 8
MMLU benchmarking settings
mmlu:
batch size for evaluation.
Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory
batch_size: auto
number of question-answer pairs provided in the context preceding the question used for evaluation
few_shots: 5
Settings to run MMLU against a branch of taxonomy containing
custom skills/knowledge used for training
mmlu_branch:
Directory where custom MMLU tasks are stored
tasks_dir: ~/.local/share/instructlab/datasets
model: null
multi-turn benchmarking settings for skills
mt_bench:
Directory where model to be used as judge is stored
judge_model: ~/.cache/instructlab/models/prometheus-8x7b-v2-0
max_workers: auto
Directory where evaluation results are stored
output_dir: ~/.local/share/instructlab/internal/eval_data/mt_bench
Settings to run MT-Bench against a branch of taxonomy containing
custom skills/knowledge used for training
mt_bench_branch:
Directory where model to be used as judge is stored
judge_model: ~/.cache/instructlab/models/prometheus-8x7b-v2-0
Path to where base taxonomy is stored
taxonomy_path: ~/.local/share/instructlab/taxonomy
general:
debug_level: 0
log_level: INFO
The default student model to use when training
student_model_id: 'granite-3.1-starter-v2'
generate:
maximum number of words per chunk
chunk_word_count: 1000
Teacher model that will be used to synthetically generate training data
model: ~/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
Number of CPU cores to use for generation
num_cpus: 2
Number of Batches to send for generation on each core.
batch_size: 256
Directory where generated datasets are stored
output_dir: ~/.local/share/instructlab/datasets
Directory where pipeline config files are stored
pipeline: /usr/share/instructlab/sdg/pipelines/agentic
Path to prompt file to be used for generation
prompt_file: ~/.local/share/instructlab/internal/prompt.txt
The total number of instructions to be generated
sdg_scale_factor: 30
seed_file: ~/.local/share/instructlab/internal/seed_tasks.json
Branch of taxonomy used to calculate diff against
taxonomy_base: empty
Directory where taxonomy is stored and accessed from
taxonomy_path: ~/.local/share/instructlab/taxonomy
Teacher model specific settings
teacher:
Serving backend to use to host the teacher model
backend: vllm
Chat template to supply to the teacher model. Possible values:
- Custom chat template string
- Auto: Uses default for serving backend
chat_template: 'tokenizer'
host and port where teacher model is being served
host_port: 127.0.0.1:8000
Llamacpp serving settings
llama_cpp:
number of model layers to offload to GPU
-1 means all
gpu_layers: -1
the family of model being served - used to determine the appropriate chat template
llm_family: ''
maximum number of tokens that can be processed by the model
max_ctx_size: 4096
Path to teacher model that will be used to synthetically generate training data
model_path: ~/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
vLLM serving settings
vllm:
number of GPUs to allocate to vLLM
gpus: 8
the family of model being served - used to determine the appropriate chat template
llm_family: 'mixtral'
additional arguments to be supplied directly to vLLM
vllm_args:

--max-num-seqs
'512'
--enable-lora
--max-lora-rank
'64'
--dtype
bfloat16
--lora-dtype
bfloat16
--fully-sharded-loras
--lora-modules

Directory where LoRA adapter for skills is stored

skill-classifier-v3-clm=$HOME/.cache/instructlab/models/skills-adapter-v3

Directory where LoRA adapter for knowledge is stored

text-classifier-knowledge-v3-clm=$HOME/.cache/instructlab/models/knowledge-adapter-v3
models:
id: llama-3.3
family: llama
path: ~/.cache/instructlab/models/meta-llama/Llama-3.3-70B-Instruct
id: granite-3.1-starter-v2
family: granite
path: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1
system_prompt: 'You are a Red Hat® Instruct Model, an AI language model developed by Red Hat and IBM Research based on the granite-3.1-8b-base model. Your primary role is to serve as a chat assistant.'
serve:

Serving backend to use to host the model
backend: vllm
Chat template to supply to the served model. Possible values:
- Custom chat template string
- Auto: Uses default for serving backend
chat_template: auto
host and port where the model is being served
host_port: 127.0.0.1:8000
Llamacpp serving settings
llama_cpp:
number of model layers to offload to GPU
-1 means all
gpu_layers: -1
the family of model being served - used to determine the appropriate chat template
llm_family: ''
maximum number of tokens that can be processed by the model
max_ctx_size: 4096
Path to model that will be served for inference
model_path: ~/.cache/instructlab/models/granite-3.1-8b-lab-v2.1
vLLM serving settings
vllm:
gpus: 8
the family of model being served - used to determine the appropriate chat template
llm_family: ''
additional arguments to be supplied directly to vLLM
vllm_args: ["--tensor-parallel-size", "8"]
train:
additional_args:
learning_rate: 6e-6
lora_alpha: 32
lora_dropout: 0.1
warmup_steps: 25
use_dolomite: false
device: cuda
pipeline: accelerated
ckpt_output_dir: ~/.local/share/instructlab/checkpoints
data_output_dir: ~/.local/share/instructlab/internal
data_path: ~/.local/share/instructlab/datasets
effective_batch_size: 128
is_padding_free: false
lora_quantize_dtype: null
lora_rank: 0
max_batch_len: 45000
max_seq_len: 42000
model_path: ~/.cache/instructlab/models/granite-3.1-8b-starter-v2.1
nproc_per_node: 8
num_epochs: 8
phased_mt_bench_judge: ~/.cache/instructlab/models/prometheus-8x7b-v2-0
phased_phase1_num_epochs: 7
phased_phase2_num_epochs: 7
phased_phase2_learning_rate: 2e-5
checkpoint_at_epoch: true
save_samples: 0
metadata:
gpu_manufacturer: Nvidia
gpu_family: H100
gpu_count: 8
gpu_sku: [80GB HBM3, NVL, PCIe]
version: 1.0.0

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates