Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2742

Omit the attempt to resume model training when it's the first attempt.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhelai-1.3.1
    • InstructLab - Core
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Low

      To Reproduce Steps to reproduce the behavior:

      1. Simply attempt to run training for the first time, by running `ilab model train -enable-serving-output --strategy lab-multiphase  -phased-phase1-data ~/.local/share/instructlab/datasets/knowledge_train_msgs_2024-12-17T21
        _14_00.jsonl   --phased-phase2-data ~/skills-15k.jsonl`
      2. output:

      ~~~~~~~~~~~~STARTING MULTI-PHASE TRAINING~~~~~~~~~~~~
      Running phased training with '2' epochs.
      Note: 7 epochs is the recommended amount for optimal performance.
      No training journal found. Will initialize at: '/var/home/cloud-user/.local/share/instructlab/phased/journalfile.yaml'
      Metadata (checkpoints, the training journal) may have been saved from a previous training run.
      By default, training will resume from this metadata if it exists.
      Alternatively, the metadata can be cleared, and training can start from scratch.

      Would you like to START TRAINING FROM THE BEGINNING?
      'y' clears metadata to start new training, 'N' tries to resume:  [y/N]: y

       

      Expected behavior

      • Omit the question and omit the attempt to resume if it's the first attempt.

       

      Device Info (please complete the following information):

      • Hardware Specs: [p5.48xlarge AWS instance]
      • OS Version: [Red Hat Enterprise Linux release 9.4]
      • InstructLab Version: [0.21.2]
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image
          • registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
        • ilab system info 
        • Platform:                                 
            sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
            sys.platform: linux        
            os.name: posix                                                
            platform.release: 5.14.0-427.42.1.el9_4.x86_64
            platform.machine: x86_64  
            platform.node: ip-10-0-26-215.us-east-2.compute.internal
            platform.python_version: 3.11.7                               
            os-release.ID: rhel                     
            os-release.VERSION_ID: 9.4
            os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
            memory.total: 1999.96 GB                                       
            memory.available: 1987.32 GB            
            memory.used: 3.48 GB      
                                       
          InstructLab:                                                     
            instructlab.version: 0.21.2             
            instructlab-dolomite.version: 0.2.0
            instructlab-eval.version: 0.4.1
            instructlab-quantize.version: 0.1.0                  
            instructlab-schema.version: 0.4.1
            instructlab-sdg.version: 0.6.1
            instructlab-training.version: 0.6.1
           
          Torch:                                                                                                                                                                                        
            torch.version: 2.4.1                                                                                                                                                                        
            torch.backends.cpu.capability: AVX2                                                                                                                                                         
            torch.version.cuda: 12.4                                                                                                                                                                    
            torch.version.hip: None                                                                                                                                                                     
            torch.cuda.available: True                                                                                                                                                                  
            torch.backends.cuda.is_built: True                                                                                                                                                          
            torch.backends.mps.is_built: False                                                                                                                                                          
            torch.backends.mps.is_available: False
            torch.cuda.bf16: True
            torch.cuda.current.device: 0
            torch.cuda.0.name: NVIDIA H100 80GB HBM3
            torch.cuda.0.free: 78.6 GB
            torch.cuda.0.total: 79.1 GB
            torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.1.name: NVIDIA H100 80GB HBM3
            torch.cuda.1.free: 78.6 GB
            torch.cuda.1.total: 79.1 GB
            torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.2.name: NVIDIA H100 80GB HBM3
            torch.cuda.2.free: 78.6 GB
            torch.cuda.2.total: 79.1 GB
            torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.3.name: NVIDIA H100 80GB HBM3
            torch.cuda.3.free: 78.6 GB
            torch.cuda.3.total: 79.1 GB
            torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.4.name: NVIDIA H100 80GB HBM3
            torch.cuda.4.free: 78.6 GB
            torch.cuda.4.total: 79.1 GB
            torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.5.name: NVIDIA H100 80GB HBM3
            torch.cuda.5.free: 78.6 GB
            torch.cuda.5.total: 79.1 GB
            torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.6.name: NVIDIA H100 80GB HBM3
            torch.cuda.6.free: 78.6 GB
            torch.cuda.6.total: 79.1 GB
            torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)   torch.cuda.7.name: NVIDIA H100 80GB HBM3
            torch.cuda.7.free: 78.6 GB
            torch.cuda.7.total: 79.1 GB
            torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
            llama_cpp_python.version: 0.2.79
            llama_cpp_python.supports_gpu_offload: True
          

              Unassigned Unassigned
              achuzhoy@redhat.com Alexander Chuzhoy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: