-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhelai-1.3.1
-
None
-
False
-
-
False
-
-
-
Low
To Reproduce Steps to reproduce the behavior:
- Simply attempt to run training for the first time, by running `ilab model train -
enable-serving-output --strategy lab-multiphase-phased-phase1-data ~/.local/share/instructlab/datasets/knowledge_train_msgs_2024-12-17T21
_14_00.jsonl --phased-phase2-data ~/skills-15k.jsonl` - output:
~~~~~~~~~~~~STARTING MULTI-PHASE TRAINING~~~~~~~~~~~~
Running phased training with '2' epochs.
Note: 7 epochs is the recommended amount for optimal performance.
No training journal found. Will initialize at: '/var/home/cloud-user/.local/share/instructlab/phased/journalfile.yaml'
Metadata (checkpoints, the training journal) may have been saved from a previous training run.
By default, training will resume from this metadata if it exists.
Alternatively, the metadata can be cleared, and training can start from scratch.
Would you like to START TRAINING FROM THE BEGINNING?
'y' clears metadata to start new training, 'N' tries to resume: [y/N]: y
Expected behavior
- Omit the question and omit the attempt to resume if it's the first attempt.
Device Info (please complete the following information):
- Hardware Specs: [p5.48xlarge AWS instance]
- OS Version: [Red Hat Enterprise Linux release 9.4]
- InstructLab Version: [0.21.2]
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image
- registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
- ilab system info
Platform: sys.version: 3.11.7 (main, Oct 9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.42.1.el9_4.x86_64 platform.machine: x86_64 platform.node: ip-10-0-26-215.us-east-2.compute.internal platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1999.96 GB memory.available: 1987.32 GB memory.used: 3.48 GB InstructLab: instructlab.version: 0.21.2 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.4.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.1 instructlab-sdg.version: 0.6.1 instructlab-training.version: 0.6.1 Torch: torch.version: 2.4.1 torch.backends.cpu.capability: AVX2 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA H100 80GB HBM3 torch.cuda.0.free: 78.6 GB torch.cuda.0.total: 79.1 GB torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA H100 80GB HBM3 torch.cuda.1.free: 78.6 GB torch.cuda.1.total: 79.1 GB torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA H100 80GB HBM3 torch.cuda.2.free: 78.6 GB torch.cuda.2.total: 79.1 GB torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA H100 80GB HBM3 torch.cuda.3.free: 78.6 GB torch.cuda.3.total: 79.1 GB torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: NVIDIA H100 80GB HBM3 torch.cuda.4.free: 78.6 GB torch.cuda.4.total: 79.1 GB torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: NVIDIA H100 80GB HBM3 torch.cuda.5.free: 78.6 GB torch.cuda.5.total: 79.1 GB torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: NVIDIA H100 80GB HBM3 torch.cuda.6.free: 78.6 GB torch.cuda.6.total: 79.1 GB torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: NVIDIA H100 80GB HBM3 torch.cuda.7.free: 78.6 GB torch.cuda.7.total: 79.1 GB torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python: llama_cpp_python.version: 0.2.79 llama_cpp_python.supports_gpu_offload: True
- sudo bootc status --format json | jq .status.booted.image.image.image