Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: rhelai-1.5, rhelai-1.5.1
Affects Version/s: rhelai-1.4.2, rhelai-1.4.1, rhelai-1.4.3
Component/s: DevOps, InstructLab - Evaluation
Labels:
- 1.5-notablocker?

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Release Blocker:
Rejected
Target Version:

rhelai-1.5

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Run mmlu with ilab trained model on 1.4.1
ilab model evaluate --model /mnt/4TB/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_379162 --benchmark mmlu

Example run:
https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9177408895

Failed with:

Requesting API:   1% 363/56168 [01:19<3:10:06,  4.89it/s]WARNING 2025-02-19 08:01:10,797 lm-eval:453: Context length (3397) + continuation length (2) > max_length (2047). Left truncating context.
Requesting API:   1% 364/56168 [01:19<3:06:50,  4.98it/s]WARNING 2025-02-19 08:01:10,990 lm-eval:453: Context length (3395) + continuation length (2) > max_length (2047). Left truncating context.
WARNING 2025-02-19 08:01:11,174 lm-eval:374: API request failed with error message: Internal Server Error. Retrying...
WARNING 2025-02-19 08:01:12,353 lm-eval:374: API request failed with error message: Internal Server Error. Retrying...
WARNING 2025-02-19 08:01:13,535 lm-eval:374: API request failed with error message: Internal Server Error. Retrying...
INFO 2025-02-19 08:01:22,117 instructlab.model.backends.vllm:494: Waiting for GPU VRAM reclamation...
ERROR 2025-02-19 08:01:29,489 instructlab.cli.model.evaluate:272: An error occurred during evaluation: 500 Server Error: Internal Server Error for url: http://127.0.0.1:39329/v1/completions
Requesting API:   1% 364/56168 [01:38<4:10:25,  3.71it/s]

Expected behavior

<your text here>

Screenshots

Attached Image

Device Info (please complete the following information):

Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]

8xA100 in IBM Cloud

OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
InstructLab Version: [output of \\\{{{}ilab --version{}}}]

ilab, version 0.23.2

Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187

"registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4"

- ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

[cloud-user@instructlab-ci-8xa100-preserve cloud-user]$ ilab system info
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Platform:
sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.50.2.el9_4.x86_64
platform.machine: x86_64
platform.node: instructlab-ci-8xa100-preserve
platform.python_version: 3.11.7
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
memory.total: 1259.87 GB
memory.available: 1248.21 GB
memory.used: 3.74 GB

InstructLab:
instructlab.version: 0.23.2
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.7.1
instructlab-training.version: 0.7.0

Torch:
torch.version: 2.5.1
torch.backends.cpu.capability: AVX512
torch.version.cuda: 12.4
torch.version.hip: None
torch.cuda.available: True
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
torch.cuda.bf16: True
torch.cuda.current.device: 0
torch.cuda.0.name: NVIDIA A100-SXM4-80GB
torch.cuda.0.free: 78.7 GB
torch.cuda.0.total: 79.1 GB
torch.cuda.0.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.1.name: NVIDIA A100-SXM4-80GB
torch.cuda.1.free: 78.7 GB
torch.cuda.1.total: 79.1 GB
torch.cuda.1.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.2.name: NVIDIA A100-SXM4-80GB
torch.cuda.2.free: 78.7 GB
torch.cuda.2.total: 79.1 GB
torch.cuda.2.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.3.name: NVIDIA A100-SXM4-80GB
torch.cuda.3.free: 78.7 GB
torch.cuda.3.total: 79.1 GB
torch.cuda.3.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.4.name: NVIDIA A100-SXM4-80GB
torch.cuda.4.free: 78.7 GB
torch.cuda.4.total: 79.1 GB
torch.cuda.4.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.5.name: NVIDIA A100-SXM4-80GB
torch.cuda.5.free: 78.7 GB
torch.cuda.5.total: 79.1 GB
torch.cuda.5.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.6.name: NVIDIA A100-SXM4-80GB
torch.cuda.6.free: 78.7 GB
torch.cuda.6.total: 79.1 GB
torch.cuda.6.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.7.name: NVIDIA A100-SXM4-80GB
torch.cuda.7.free: 78.7 GB
torch.cuda.7.total: 79.1 GB
torch.cuda.7.capability: 8.0 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:
llama_cpp_python.version: 0.3.2
llama_cpp_python.supports_gpu_offload: True

Bug impact

Please provide information on the impact of this bug to the end user.

Unable to eval knowledge regression

Known workaround

Please add any known workarounds.

Additional context

I have reran mmlu after the failed run and the issue seems consistent

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates