Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2714

RHELAI 1.3: bunch of warning shown during training

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • rhelai-1.3.1
    • InstructLab - Training
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Moderate

      To Reproduce Steps to reproduce the behavior:

      1. On GCP with 8xH100 gpus run RHEAI 1.3.1 run lab-multphase training
      2. redirect logs to a file
      3. Check contents of the file

      Many warnings are shown :

      /opt/app-root/lib64/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` in
      stead.
        with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
      /opt/app-root/lib64/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` in
      stead.

      also this one:

        warnings.warn(
      /opt/app-root/lib64/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .

      Mentioning that no side-effect is seen and short training completes successfully

      Expected behavior

      • No warnings to be shown
      • If there are warnings, it should not fill up the whole screen. 

      Screenshots

      • Attached Image 

      Device Info (please complete the following information):

      • Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]
      • OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
      • InstructLab Version: [output of \\\{{{}ilab --version{}}}]
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
        • ilab system info to print detailed information about InstructLab version, OS, and hardware - including GPU / AI accelerator hardware

      Additional context

      • <your text here>
      • ...
      • ...

        1. train-output3.txt
          570 kB
          Constantin Daniel Vultur

              Unassigned Unassigned
              cvultur@redhat.com Constantin Daniel Vultur
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: