Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: rhelai-1.3.1
Affects Version/s: None
Component/s: InstructLab - Training
Labels:
- closed-upstream

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Git Pull Request:
https://github.com/instructlab/instructlab/pull/2745

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Research expert Abhishek Bhandwaldar noted that the fine tuning phases should be running with different learning rates (2e-5 for phase 1 training and 6e-6 for phase 2 skills training) on granite 8b starter model in order to produce optimal results. Currently when you run ilab model train: you only get the option to specify one constant learning rate across the phases.

Expected behavior

IBM Research expert Abhishek Bhandwaldar noted that the fine tuning phases should be running with different learning rates (2e-5 for phase 1 training and 6e-6 for phase 2 skills training) on granite 8b starter model in order to produce optimal results. The difference in expected results from running with a constant of 6e-6 across both phases is currently unknown

Screenshots

Attached Image

Device Info (please complete the following information):

Hardware Specs: 8x A100 machine IBM Cloud
OS Version: RHEL AI 1.3
InstructLab Version:
ilab, version 0.21.0

Provide the output of these two commands:
- "registry.redhat.io/rhelai1/bootc-ibm-nvidia-rhel9:1.3"

Platform:

sys.version: 3.11.7 (main, Oct 9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]

sys.platform: linux

os.name: posix

platform.release: 5.14.0-427.42.1.el9_4.x86_64

platform.machine: x86_64

platform.node: tyler-machine-boot-6

platform.python_version: 3.11.7

os-release.ID: rhel

os-release.VERSION_ID: 9.4

os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)

memory.total: 1259.87 GB

memory.available: 1196.92 GB

memory.used: 38.83 GB

InstructLab:

instructlab.version: 0.21.0

instructlab-dolomite.version: 0.2.0

instructlab-eval.version: 0.4.1

instructlab-quantize.version: 0.1.0

instructlab-schema.version: 0.4.1

instructlab-sdg.version: 0.6.1

instructlab-training.version: 0.6.1

Torch:

torch.version: 2.4.1

torch.backends.cpu.capability: AVX512

torch.version.cuda: 12.4

torch.version.hip: None

torch.cuda.available: True

torch.backends.cuda.is_built: True

torch.backends.mps.is_built: False

torch.backends.mps.is_available: False

torch.cuda.bf16: True

torch.cuda.current.device: 0

torch.cuda.0.name: NVIDIA A100-SXM4-80GB

torch.cuda.0.free: 69.5 GB

torch.cuda.0.total: 79.1 GB

torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.1.name: NVIDIA A100-SXM4-80GB

torch.cuda.1.free: 69.4 GB

torch.cuda.1.total: 79.1 GB

torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.2.name: NVIDIA A100-SXM4-80GB

torch.cuda.2.free: 69.4 GB

torch.cuda.2.total: 79.1 GB

torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.3.name: NVIDIA A100-SXM4-80GB

torch.cuda.3.free: 69.4 GB

torch.cuda.3.total: 79.1 GB

torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.4.name: NVIDIA A100-SXM4-80GB

torch.cuda.4.free: 69.4 GB

torch.cuda.4.total: 79.1 GB

torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.5.name: NVIDIA A100-SXM4-80GB

torch.cuda.5.free: 69.4 GB

torch.cuda.5.total: 79.1 GB

torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.6.name: NVIDIA A100-SXM4-80GB

torch.cuda.6.free: 69.4 GB

torch.cuda.6.total: 79.1 GB

torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

torch.cuda.7.name: NVIDIA A100-SXM4-80GB

torch.cuda.7.free: 69.3 GB

torch.cuda.7.total: 79.1 GB

torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:

llama_cpp_python.version: 0.2.79

llama_cpp_python.supports_gpu_offload: True

Additional context

<your text here>
…
…

Assignee:: Mustafa Eyceoz

Reporter:: Tyler Lisowski (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/12/04 7:55 PM

Updated:: 2025/01/06 8:44 PM

Resolved:: 2025/01/06 8:44 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates