Loading...

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.3.1
Component/s: InstructLab - CLI
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Simply attempt to run training for the first time, by running `ilab model train -~~enable-serving-output --strategy lab-multiphase~~ -phased-phase1-data ~/.local/share/instructlab/datasets/knowledge_train_msgs_2024-12-17T21
_14_00.jsonl --phased-phase2-data ~/skills-15k.jsonl`
output:

~~~~~~~~~~~~_{STARTING MULTI-PHASE TRAINING}~~~~~~~~~~~~
Running phased training with '2' epochs.
Note: 7 epochs is the recommended amount for optimal performance.
No training journal found. Will initialize at: '/var/home/cloud-user/.local/share/instructlab/phased/journalfile.yaml'
Metadata (checkpoints, the training journal) may have been saved from a previous training run.
By default, training will resume from this metadata if it exists.
Alternatively, the metadata can be cleared, and training can start from scratch.

Would you like to START TRAINING FROM THE BEGINNING?
'y' clears metadata to start new training, 'N' tries to resume: [y/N]: y

Expected behavior

Omit the question and omit the attempt to resume if it's the first attempt.

Device Info (please complete the following information):

Hardware Specs: [p5.48xlarge AWS instance]
OS Version: [Red Hat Enterprise Linux release 9.4]
InstructLab Version: [0.21.2]

Provide the output of these two commands:

sudo bootc status --format json | jq .status.booted.image.image.image
- registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
ilab system info

Platform:                                 
  sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux        
  os.name: posix                                                
  platform.release: 5.14.0-427.42.1.el9_4.x86_64
  platform.machine: x86_64  
  platform.node: ip-10-0-26-215.us-east-2.compute.internal
  platform.python_version: 3.11.7                               
  os-release.ID: rhel                     
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1999.96 GB                                       
  memory.available: 1987.32 GB            
  memory.used: 3.48 GB      
                             
InstructLab:                                                     
  instructlab.version: 0.21.2             
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.4.1
  instructlab-quantize.version: 0.1.0                  
  instructlab-schema.version: 0.4.1
  instructlab-sdg.version: 0.6.1
  instructlab-training.version: 0.6.1
 
Torch:                                                                                                                                                                                        
  torch.version: 2.4.1                                                                                                                                                                        
  torch.backends.cpu.capability: AVX2                                                                                                                                                         
  torch.version.cuda: 12.4                                                                                                                                                                    
  torch.version.hip: None                                                                                                                                                                     
  torch.cuda.available: True                                                                                                                                                                  
  torch.backends.cuda.is_built: True                                                                                                                                                          
  torch.backends.mps.is_built: False                                                                                                                                                          
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA H100 80GB HBM3
  torch.cuda.0.free: 78.6 GB
  torch.cuda.0.total: 79.1 GB
  torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA H100 80GB HBM3
  torch.cuda.1.free: 78.6 GB
  torch.cuda.1.total: 79.1 GB
  torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: NVIDIA H100 80GB HBM3
  torch.cuda.2.free: 78.6 GB
  torch.cuda.2.total: 79.1 GB
  torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: NVIDIA H100 80GB HBM3
  torch.cuda.3.free: 78.6 GB
  torch.cuda.3.total: 79.1 GB
  torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.4.name: NVIDIA H100 80GB HBM3
  torch.cuda.4.free: 78.6 GB
  torch.cuda.4.total: 79.1 GB
  torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.5.name: NVIDIA H100 80GB HBM3
  torch.cuda.5.free: 78.6 GB
  torch.cuda.5.total: 79.1 GB
  torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.6.name: NVIDIA H100 80GB HBM3
  torch.cuda.6.free: 78.6 GB
  torch.cuda.6.total: 79.1 GB
  torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)   torch.cuda.7.name: NVIDIA H100 80GB HBM3
  torch.cuda.7.free: 78.6 GB
  torch.cuda.7.total: 79.1 GB
  torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
  llama_cpp_python.version: 0.2.79
  llama_cpp_python.supports_gpu_offload: True

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates